Intel Page Modification Logging, a hardware virtualization feature:
  study and improvement for virtual machine working set estimation by Bitchebe, Stella et al.
Intel Page Modification Logging, a hardware
virtualization feature: study and improvement for
virtual machine working set estimation
Stella Bitchebe
Djob Mvondo
Alain Tchana
Ecole Normale Supérieure de Lyon
prenom.nom@ens-lyon.fr
Laurent Réveillère
Laboratoire Bordelais de Recherche
en Informatique
prenom.nom@labri.fr
Noël de Palma
Laboratoire d’Informatique de
Grenoble
prenom.nom@lig.fr
Abstract
Intel Page Modification Logging (PML) is a novel hardware
feature for tracking virtual machine (VM) accessed memory
pages. This task is essential in today’s data centers since
it allows, among others, checkpointing, live migration and
working set size (WSS) estimation. Relying on the Xen hy-
pervisor, this paper studies PML from three angles: power
consumption, efficiency, and performance impact on user
applications. Our findings are as follows. First, PML does
not incur any power consumption overhead. Second, PML
reduces by up to 10.18% both VM live migration and check-
pointing time. Third, PML slightly reduces by up to 0.95%
the performance degradation on applications incurred by live
migration and checkpointing. Fourth, PML however does not
allow accurate WSS estimation because read accesses are not
tracked and hot pages cannot be identified. A naive extension
of PML for addressing these limitations could lead to severe
performance degradation (up to 34.8%) for the VM whose
WSS is computed.
This paper presents Page Reference Logging (PRL), a
smart extension of PML for allowing both read and write
accesses to be tracked. It does this without impacting user
VMs. The paper also presents a WSS estimation system
which leverages PRL and shows how this algorithm can be
integrated into a data center which implements memory over-
commitment. We implement PRL and the WSS estimation
system under Gem5, a very popular hardware simulator. The
evaluation results validate the accuracy of PRL in the estima-
tion of WSS. They also show that PRL incurs no performance
degradation for user VMs.
1 Introduction
Virtualization has become the foundation of data centers as it
allows resource mutualization between multiple clients while
ensuring isolation. Virtualization also provides adequate sup-
port for administration including resource management. In
this context, memory page tracking is a key mechanism which
allows keeping track of memory accesses by virtual machines,
so that the hypervisor can improve memory management for
different services it implements. Memory page tracking is at
the heart of several essential tasks such as checkpointing [64]
(for recovery after failure), live migration [30] (for mainte-
nance and dynamic packing) and working set size (WSS)
estimation [33] (for memory overcommitment [10] and fast
restore [65]).
The widely used approach for implementing memory page
tracking relies on present bit invalidation. Such an approach
leads to severe performance degradation (caused by the gen-
erated page faults), especially when a significant amount of
pages needs to be tracked like in WSS estimation [28, 36, 41,
46, 52, 60]. We assessed this using a synthetic application
which parses an array. The present bit is invalidated every
second for all the VM’s memory pages. We measured up to
96,22% of performance degradation for a 1GB VM memory
size. To minimize the overhead of this approach, VMware’s
WSS estimation technique applies invalidation on a reduced
(100) random page sample [60]. But this alternative proved
to result in an unprecise WSS estimation [52], also assessed
in this paper (see Section 5).
Intel in collaboration with VMware [5] started in 2015 to
release processors (e.g., Broadwell Xeons) equipped with
Page-Modification Logging (PML for short) [11], a mem-
ory page tracking technology. When enabled, the Memory
Management Unit (MMU) logs (in a RAM area) any guest
physical address (GPA) which leads to the setting of the dirty
bit in the extended page table (EPT) during page walks (more
details below).
This paper presents two main contributions: (1) a deep
analysis and evaluation of the PML technology, and (2) a
smart extension of PML which addresses the issues that we
identified with PML.
(First contribution) Traditionally, new hardware evolutions
(especially for Hardware Assisted Virtualization (HAV)) lead
to investigations regarding their effectiveness compared with
previous solutions. For instance recently, the introduction of
Extended/Nested Page Table [59] led to many studies [25, 35,
62] comparing it with shadow paging [60]. In the same spirit,
we propose a deep study of PML from three perspectives:
power consumption, efficiency, and performance impact on
user applications. Our findings are summarized as follows.
First, PML incurs no power consumption overhead. Second,
1
ar
X
iv
:2
00
1.
09
99
1v
1 
 [c
s.D
C]
  2
6 J
an
 20
20
PML reduces both VM live migration and checkpointing time
(by up to 10.18%). Third, PML slightly reduces (by up to
0.95%) the performance degradation on applications incurred
by live migration and checkpointing, which is not negligible
for tail latencies [39]. Fourth, PML however does not allow
accurate WSS estimation.
More precisely, the current PML design includes several
limitations which prevent effective WSS estimation. We clas-
sify them in two types:
• Hard limits, related to PML features: (1) PML does
not track every page access. It only tracks page modi-
fications, making it inaccurate for read only and mix
(read-write) workloads; (2) PML does not allow track-
ing of hot pages. It only logs a page once (even if
accessed several times), thus cold pages are likely
to be counted in the WSS, over-estimating the latter
(which leads to memory waste).
• Soft limits, related to PML overhead: PML incurs
an unacceptable overhead for the VM whose WSS
is estimated. This is caused by the fact that PML
generates VMExits which are handled by the CPUs of
the VM whose WSS is computed. The handler of that
VMExit can consume a significant CPU time, which
is taken from the VM’s CPU quota. We measured up
to 34.8% of performance degradation when PML is
activated and used for WSS estimation. This is not
acceptable for cloud users because they are not the
beneficiaries of WSS estimation. WSS estimation is
executed on the account of the data center operator.
According to these limitations, it appears that Intel developers
mainly focused on write workload (hence the name PML),
which already represents a significant contribution, but does
not allow an effective implementation of WSS estimation.
WSS estimation is an essential task for data center opera-
tors because it allows, among others, memory overcommit-
ment [52, 54] (periodically adapting the VM memory size
according to its actual needs), fast restore [65] and efficient
processor cache partitioning [7]. Regarding memory over-
commitment for example, its implementation and adoption
are necessary for the following reasons. First, VM owners
use to over-estimate resources [32, 38] for their tasks. Jyothi
et al. [38] analyzed resource reservation for a 50k nodes
production data center and found that 75% of jobs were over-
provisioned (even at their peak), with 20% of them over ×10
over-provisioned. E. Cortez et al. [31] made similar observa-
tions in recent Microsoft Azure cloud traces. Second, some
types of cloud native workloads fundamentally rely on dy-
namic management of overcommitted VMs. In particular,
this is the case for serverless/FaaS (Function as a Service)
systems, whose design and pricing model are inherently tied
to aggressive packing of hundreds or thousands of micro VMs
on the same physical machine [27, 61]. Third, memory is
a limited resource whose evolution does not follow that of
other resources (especially CPU), so that researchers talked
about the memory wall issue [45, 56]. Fourth, we believe
that economic factors (driven by the increased competition
between cloud providers, as well as rising costs of energy and
hardware1, possibly increased by higher rates of environmen-
tal taxes) will push providers to resort more aggressively to
resource overcommitment in production data centers in order
to achieve higher resource utilization.
(Second contribution) This paper presents Page Reference
Logging (PRL for short), a smart extension of PML for track-
ing both read and write page accesses in order to facilitate
WSS estimation. We show that a naive extension of PML
(by simply taking into account read accesses) would lead to
severe performance degradation (up to 34.8% as indicated
above). With PRL, we allow two exclusive modes: PRLPML
and PRLPAML . The former is similar to the current PML
functioning, making PRL effective for live migration and
checkpointing. PRLPAML focuses on WSS estimation. In
PRLPAML mode, read accesses are taken into account, and
several loggings of the same page are also possible, making it
possible to track hot pages. Also, PRLPAML prevent any over-
head on user VMs as follows. VMExits related to PRLPAML
are redirected to the privileged/controller VM’s CPUs. No-
tice that the privileged VM (noted pVM), which is present
in almost all virtualized systems (e.g. called dom0 in Xen),
belongs to the data center operator. Thus, its utilization for
hosting WSS estimation computations totally makes sense
since the data center operator is the main beneficiary of this
task. Basically, PRLPAML logs any GPA that is at the origin of
an EPT walk. When the PRLPAML logging buffer is full, the
actual CPU (which runs the user VM whose WSS is actually
computed) sends an Inter-Processor Interrupt (IPI, a new one
that we introduce) to one of the pVM’s CPU, thus raising
a VMExit on the latter. By redirecting VMExits related to
PRLPAML to the pVM, the user VM can continue its execu-
tion during the handling of that VMExits unlike in PML, thus
avoiding the negative impact of user VMs. The handler of
that new IPI identifies hot pages and puts them at the disposal
of a WSS estimation system. A prototype of the latter which
leverages PRL is also presented in this paper.
In summary, the paper makes the following contributions:
• We present the first complete study of PML from
different angles. We list and assess the limitations of
PML in the task of WSS estimation, which is a critical
issue for data center operators.
• In the light of our analysis, we propose Page Reference
Logging (PRL), an extension of PML which makes it
effective for WSS estimation. Our contribution has
almost the same complexity as PML and we believe
1We expect hardware manufacturing costs to increase due to various factors
such as more complex semiconductor manufacturing processes and rarefac-
tion of strategic raw materials. Accordingly, despite improvements in energy
proportionality [23], high resource utilization will be necessary to amortize
hardware acquisition costs.
2
it could be easily integrated by Intel. We prototyped
PRL in Gem5 [26], a popular hardware simulator.
• We also describe a WSS estimation system which
leverages PRL. We implemented a prototype in Xen
and Gem5.
• Using both real (HPL Linpack [17], BigDataBench [16])
and synthetic applications, we evaluated and com-
pared our solution with VMware’s WSS estimation
solution (which does not rely on PRL). The evaluation
results confirm that: (1) our solution is accurate, un-
like VMware (which generates quite random values);
(2) our solution does not impact user VMs, unlike
VMware (which sometimes lead to VM crashing);
(3) our solution is not intrusive (no guest OS code
modification is required), unlike other state-of-the-art
solutions [28, 36, 41, 46, 52, 60].
We make available the entire source code of PRL, PML and
the WSS estimation system so that other researchers can
repeat or improve our work.
The remainder of the paper is as follows. Section 2 intro-
duces PML. Section 3 studies PML’s performance. Section 4
presents PRL and a WSS estimation system which leverages
PRL. Section 5 presents the evaluation results. Section 6
presents the related work. Section 7 concludes the paper.
2 Background
J
<latexit sha1_base64="IFGf+XFu/QhoPWZM5A 44cK2iw1c=">AAAC9HicjVLLTsJAFL3UF+ILdemmkZi4IgVNcEl04xITeRgkpi0DTugr0ykJIXyFW1 wZt/6Pf6DxJzwzlkQlPqZpe+bcc+7MnTtO5PFYWtZzxlhYXFpeya7m1tY3Nrfy2zuNOEyEy+pu6IWi 5dgx83jA6pJLj7UiwWzf8VjTGZypeHPIRMzD4FKOItbx7X7Ae9y1Jair6y4P+uPK8eQmX7CKlh7mPC iloEDpqIX5N7qmLoXkUkI+MQpIAntkU4ynTSWyKALXoTE4AcR1nNGEcvAmUDEobLADfPuYtVM2wFzl jLXbxSoeXgGnSQfwhNAJYLWaqeOJzqzYn3KPdU61txH+TprLByvpFuxfvpny/74IFfewN3Uyv9csoT vRtXKoI82oU3DT1RJ9eqpC81P1EhkicAp3ERfArnbO+mFqT6zPSPXA1vEXrVSsmrupNqFXVQ0uQul7 2+dBo1wsHRXLF+VC9TS9Elnao306RN8rVKVzqlEduX26oyndG0NjajwYjx9SI5N6dunLMJ7eAdltm5s =</latexit>
∑
<latexit s ha1_base64="Eo/9914/ gxxK4jyBkHl/Vv7w0yE =">AAAC9XicjVLLTsJAF L3UF+ILdemmkZi4Ii0u ZEl04xITeUQgpi0DTiht M51qCOEv3OLKuPV7/AO NP+GZsSQq8TFN2zPnnnN n7txxI5/H0rKeM8bC4tL ySnY1t7a+sbmV396px2 EiPFbzQj8UTdeJmc8DVp Nc+qwZCeYMXZ813MGpi jdumIh5GFzIUcQ6Q6cf8 B73HAnqst3lQX9sl48m V/mCVbT0MOeBnYICpaMa 5t+oTV0KyaOEhsQoIAn sk0MxnhbZZFEErkNjcAK I6zijCeXgTaBiUDhgB/ j2MWulbIC5yhlrt4dVfL wCTpMO4AmhE8BqNVPHE 51ZsT/lHuucam8j/N001 xCspGuwf/lmyv/7IlTcw 97Uyfxes4SurGvlUEea Uafgpasl+vRUhean6iUy ROAU7iIugD3tnPXD1J5 Yn5HqgaPjL1qpWDX3Um1 Cr6oaXAT7e9vnQb1UtI +KpfNSoXKSXoks7dE+Ha Lvx1ShM6pSDbkDuqMp3 Ru3xtR4MB4/pEYm9ezSl 2E8vQN+iZvW</latexi t> ∏
<latexit sha1_base64="rdqhxpJO3Qb7KlC6bv ENV2Zh4dk=">AAAC9XicjVLLSsNAFL2Nr1pfVZdugkVwVZIq2GXRjcsK9oFtkSSd1sE0CclEKaV/4b auxK3f4x8o/oRnximoxceEJGfOPefO3LnjRj5PhGU9Z4y5+YXFpexybmV1bX0jv7lVT8I09ljNC/0w brpOwnwesJrgwmfNKGbOwPVZw70+kfHGDYsTHgbnYhixzsDpB7zHPUeAumh3edAf2eXD8WW+YBUtNc xZYGtQID2qYf6N2tSlkDxKaUCMAhLAPjmU4GmRTRZF4Do0AhcDcRVnNKYcvClUDAoH7DW+fcxamg0w lzkT5fawio83htOkPXhC6GJguZqp4qnKLNmfco9UTrm3If6uzjUAK+gK7F++qfL/vggV97A3eTK/1y ygK6taOdSRYuQpeHq1VJ2erND8VL1AhgicxF3EY2BPOaf9MJUnUWcke+Co+ItSSlbOPa1N6VVWg4tg f2/7LKiXivZBsXRWKlSO9ZXI0g7t0j76fkQVOqUq1ZA7oDua0L1xa0yMB+PxQ2pktGebvgzj6R2BEpv X</latexit> π
<latexit sha1_base64="nkR5ykPKIw8t2O6VFV C3X75ssRw=">AAAC9XicjVLLSsNAFL2Nr1pfVZdugkVwVZKK2GXRjcsK9oFtkSSd1sE0CclEKaV/4b auxK3f4x8o/oRnximoxceEJGfOPefO3LnjRj5PhGU9Z4y5+YXFpexybmV1bX0jv7lVT8I09ljNC/0w brpOwnwesJrgwmfNKGbOwPVZw70+kfHGDYsTHgbnYhixzsDpB7zHPUeAumh3edAf2eXD8WW+YBUtNc xZYGtQID2qYf6N2tSlkDxKaUCMAhLAPjmU4GmRTRZF4Do0AhcDcRVnNKYcvClUDAoH7DW+fcxamg0w lzkT5fawio83htOkPXhC6GJguZqp4qnKLNmfco9UTrm3If6uzjUAK+gK7F++qfL/vggV97A3eTK/1y ygK6taOdSRYuQpeHq1VJ2erND8VL1AhgicxF3EY2BPOaf9MJUnUWcke+Co+ItSSlbOPa1N6VVWg4tg f2/7LKiXivZBsXRWKlSO9ZXI0g7t0j76fkQVOqUq1ZA7oDua0L1xa0yMB+PxQ2pktGebvgzj6R2Dm5v Y</latexit>
∫
<latexit sha1_base64="2N0+0Q8rW6mmPv2wJM QELFDZD94=">AAAC9XicjVLLSsNAFL2Nr1pfVZdugkVwVZIK2mXRjcsK9oFtkSSd1sE0CclEKaV/4b auxK3f4x8o/oRnximoxceEJGfOPefO3LnjRj5PhGU9Z4y5+YXFpexybmV1bX0jv7lVT8I09ljNC/0w brpOwnwesJrgwmfNKGbOwPVZw70+kfHGDYsTHgbnYhixzsDpB7zHPUeAumh3edAf2eXD8WW+YBUtNc xZYGtQID2qYf6N2tSlkDxKaUCMAhLAPjmU4GmRTRZF4Do0AhcDcRVnNKYcvClUDAoH7DW+fcxamg0w lzkT5fawio83htOkPXhC6GJguZqp4qnKLNmfco9UTrm3If6uzjUAK+gK7F++qfL/vggV97A3eTK/1y ygK6taOdSRYuQpeHq1VJ2erND8VL1AhgicxF3EY2BPOaf9MJUnUWcke+Co+ItSSlbOPa1N6VVWg4tg f2/7LKiXivZBsXRWKlSO9ZXI0g7t0j76fkQVOqUq1ZA7oDua0L1xa0yMB+PxQ2pktGebvgzj6R2GJJv Z</latexit>
Hypervisor
CPU
RAM
user VM pVM
∏
<latexit sha1_base64="rdqhxpJO3Qb7KlC6bv ENV2Zh4dk=">AAAC9XicjVLLSsNAFL2Nr1pfVZdugkVwVZIq2GXRjcsK9oFtkSSd1sE0CclEKaV/4b auxK3f4x8o/oRnximoxceEJGfOPefO3LnjRj5PhGU9Z4y5+YXFpexybmV1bX0jv7lVT8I09ljNC/0w brpOwnwesJrgwmfNKGbOwPVZw70+kfHGDYsTHgbnYhixzsDpB7zHPUeAumh3edAf2eXD8WW+YBUtNc xZYGtQID2qYf6N2tSlkDxKaUCMAhLAPjmU4GmRTRZF4Do0AhcDcRVnNKYcvClUDAoH7DW+fcxamg0w lzkT5fawio83htOkPXhC6GJguZqp4qnKLNmfco9UTrm3If6uzjUAK+gK7F++qfL/vggV97A3eTK/1y ygK6taOdSRYuQpeHq1VJ2erND8VL1AhgicxF3EY2BPOaf9MJUnUWcke+Co+ItSSlbOPa1N6VVWg4tg f2/7LKiXivZBsXRWKlSO9ZXI0g7t0j76fkQVOqUq1ZA7oDua0L1xa0yMB+PxQ2pktGebvgzj6R2BEpv X</latexit>
∂
<latexit sha1_base64 ="5CkInOYvJuP/HzMLecd62IWaa6c=">AAAC9Xi cjVLLTsJAFL3UF+ILdemmkZi4Im1dyJLoxiUm8o hATFsGnNBX2qmGEP7CLa6MW7/HP9D4E54ZS6ISH 9O0PXPuOXfmzh0n8ngiDOM5py0sLi2v5FcLa+sb m1vF7Z1GEqaxy+pu6IVxy7ET5vGA1QUXHmtFMbN 9x2NNZ3gq480bFic8DC7EKGJd3x4EvM9dW4C67P R4MBibFWtyVSwZZUMNfR6YGShRNmph8Y061KOQXE rJJ0YBCWCPbErwtMkkgyJwXRqDi4G4ijOaUAHeF CoGhQ12iO8As3bGBpjLnIlyu1jFwxvDqdMBPCF0 MbBcTVfxVGWW7E+5xyqn3NsIfyfL5YMVdA32L99 M+X9fhIr72Js8md9rFtBVVK0c6kgx8hTcbLVUnZ 6sUP9UvUCGCJzEPcRjYFc5Z/3QlSdRZyR7YKv4i 1JKVs7dTJvSq6wGF8H83vZ50LDK5lHZOrdK1ZPs SuRpj/bpEH0/piqdUY3qyB3QHU3pXrvVptqD9vg h1XKZZ5e+DO3pHXwAm9U=</latexit>
Logging
PML activation
PML desactivation
Log treatment
(checkpointing, 
live migration or 
WSS estimation)
Log pre-treatment &
backup
Log pre-treatment & backup
Log full VMEXIT /
IPI
PML log Backup
pVM’s memory
Figure 1. Basic utilization of PML for improving a virtu-
alization operation (live migration, checkpointing and WSS
estimation).
2.1 Page Modification Logging (PML)
PML [11] is an Intel Hardware-Assisted Virtualization (HAV)
feature for memory page tracking. It relies on Extended
Page Table (EPT) [59] memory virtualization. PML requires
specific changes in the virtual machine control structure
(VMCS)2. A new 64-bit VM-execution control field called
2VMCS is a control structure associated with each vCPU when a VM runs
in the Hardware-Assisted Virtualization (HAV) mode.
PML address is introduced. The PML address points to a 4KB
aligned physical memory page called PML logging buffer.
The buffer is organized in 512 64-bit entries which store
logged guest physical addresses (GPAs) (see below). A new
16-bit guest-state field called PML index is also introduced.
The PML index is the logical index of the next entry in the
logging buffer. Because the buffer includes 512 entries, the
PML index is typically a value in the range 0-511 (starting
from 511). When PML is enabled, each write instruction
which sets a dirty flag in the EPT during a page walk triggers
the logging of the GPA which is at its origin. The PML index
is decremented after each logging operation. Whenever the
PML logging buffer is full, the processor raises a VMExit
and the hypervisor comes into play. The logging process
restarts once the PML index is reset. The actions taken by the
hypervisor in response to that VMExit depend on the targeted
goal (e.g. VM live migration).
The next section shows how PML can be integrated into the
general process of several virtualization operations including
VM live migration, checkpointing and WSS estimation.
2.2 Typical PML utilization architecture
Fig. 1 shows the general functioning of a machine which
relies on PML for improving a virtualization operation (e.g.,
live migration). The figure shows on the one hand the user
VM (green) which is the target of the virtualization operation.
The privileged VM (noted pVM)3 runs the system which
implements the virtualization operation. The execution of that
system typically begins by activating PML for the target user
VM (Fig. 1, J). Then the CPU of that VM can start logging
GPAs (Fig. 1, ¶). On PML logging buffer full, the CPU
raises a VMExit which traps inside the hypervisor (Fig. 1,
·). The handler of that VMExit does a certain task (e.g.,
copy the content of the PML logging buffer to a larger buffer
which is shared with the pVM (Fig. 1, ¸)). Then the PML
index is reset to 511 and the VM resumes (VMEnter). The
system which implements the virtualization operation (in the
pVM) periodically operates on the results generated by the
log full handler (Fig. 1, ¹). This is done in respect with
the virtualization operation, e.g., remigrate dirty pages in the
case of live migration. Once the virtualization operation ends,
PML is disabled (Fig. 1, º).
3 PML study
In this section we study PML from three angles: power con-
sumption, efficiency, and performance overhead. The two
latter angles are studied under the execution of checkpointing,
live migration, and WSS estimation operations.
3The pVM is called dom0 in Xen [21], host OS in KVM [1], parent partition
in Hyper-V [6], and Service Console in VMware [14, 15]
3
1 i n t app ( i n t wi ) {
2 unsigned long opType , nbOps =0;
3 t a b = ma l lo c ( . . ) ;
4 f o r ( i =0 ; i <D; i ++){
5 f o r ( j =0 ; j <N; j ++){
6 opType=random ( 1 0 0 ) ;
7 i f ( opType <wi )
8 t a b [ j ]= v a l ;
9 e l s e
10 v a l = t a b [ j ] ;
11 nbOps ++;
12 }
13 / * opThroughput : t h e number o f
14 o p e r a t i o n per nanosec * /
15 p r i n t f ( "%f \ n " , opThroughput ) ;
16 }
17 f r e e ( t a b ) ;
18 }
Figure 2. The synthetic application template. Its performance
metric is the number of operations per nanosecond.
3.1 Experimental environment
We realized the experiments on a machine whose character-
istics are: Single socket Intel(R) core (TM) i7-3768, 16GB
memory, 500GB SSD, 4-way 64 TLB entries. We used Xen
4.7 as the hypervisor and Linux 4.15.0 for the guest kernel.
Regarding the applications which run inside the VM, we used
HPL Linpack [17], BigDataBench [16] (read, write and sort
applications, 10GB data set size), and a synthetic application.
The template of the latter is presented in Fig. 2, interpreted
as follows. It consists in parsing an array several times. Each
array entry points to a 4KB (size of a memory page) data
structure. The operation type (read or write) performed on an
array entry is decided according to a write intensity parame-
ter (wi) which represents the proportion of write operations.
Otherwise indicated, the array uses 400MB of memory and
the VM has one vCPU and 1GB of memory.
3.2 Power consumption
We evaluated the potential additional power consumption
incurred by PML with all benchmarks. For the synthetic
application, we vary the proportion of write operations (0%,
50%, 80%, and 100%). We use the turbostat tool in the pVM
(dom0 in Xen) to collect both CPU and memory power con-
sumption results, presented in Fig. 3. The latter only presents
results for the synthetic application, which is very represen-
tative. Power consumption incurred by PML is almost nil.
Note that in this specific experiment, the synthetic applica-
tion is CPU and memory intensive while the VMExit handler
consumes very few resources.
 8
 8.5
 9
 9.5
 10
 10.5
 11
 11.5
(C
PU
+R
AM
) W
att
s
Time
noPML-0
PML-0
 8
 8.5
 9
 9.5
 10
 10.5
 11
 11.5
(C
PU
+R
AM
) W
att
s
Time
noPML-50
PML-50
 8
 8.5
 9
 9.5
 10
 10.5
 11
 11.5
(C
PU
+R
AM
) W
att
s
Time
noPML-80
PML-80
 8
 8.5
 9
 9.5
 10
 10.5
 11
 11.5
(C
PU
+R
AM
) W
att
s
Time
noPML-100
PML-100
Figure 3. Power consumption related to PML. In (no)PML-x,
x represents the proportion of write operations performed by
the synthetic application.
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 0  200  400  600  800  1000 1200 1400
o
ps
/n
s
Time
noPML-0
PML-0
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 0  200  400  600  800  1000 1200 1400
o
ps
/n
s
Time
noPML-50
PML-50
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 0  200  400  600  800  1000 1200 1400
o
ps
/n
s
Time
noPML-80
PML-80
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 0  200  400  600  800  1000 1200 1400
o
ps
/n
s
Time
noPML-100
PML-100
0 50 80 10
0
0
0.5
1
1.5
6
·1
0−
2
0.
14
0.
39
0.
95
pe
rf
.i
m
pr
ov
em
en
t
Figure 4. PML benefits during live migration. The first four
curves show the number of operations per nano second during
the execution of the application while two live migrations
are performed. We evaluated different write intensity values
(0%, 50%, 80%, and 100%). The last curve (histogram)
summarizes the improvement brought by PML.
3.3 VM migration and checkpointing
VM live migration and checkpointing share some portion
of code such as the one which performs memory saving
(static int save(...) from xc_sr_save.c in Xen).
4
0 50 80 10
0
0
5
10
15
10
.1
9
5.
99
2.
43
0.
98
sa
ve
(..
.)
im
pr
ov
em
en
t
Figure 5. Improvement (execution time reduction) of
static int save(...) method, which dictates the
duration of VM live migration.
The intervention of PML in these two operations is limited
to this memory saving phase. The latter dictates the duration
of the operation. We compare the utilization of PML with
the classical memory page tracking approach which consists
in write protecting memory pages so that next writes lead
to page faults. We use this classical approach as the base-
line for the comparison. We are interested in two metrics:
m1 the performance of the user application during check-
pointing/migration andm2 the duration of the static int
save(...) method. m1 allows to check whether PML
reduces or increases the negative impact of these operations
on the application while m1 tells whether PML accelerates
checkpointing/migration. To avoid noises during checkpoint-
ing, the destination file is mounted into the RAM. We use the
synthetic application for this evaluation while varying the pro-
portion of write operations (in order to stress PML) as above.
We use this application because its behaviour is predictable in
comparison with the macro-benchmark. Two successive live
migration (respectively checkpointing) operations are done
during the execution of the application.
Fig. 4 presents the results for m1, the performance of the
application (number of operations per nano second) during
two consecutive live migration operations. We can observe
that even with PML, live migration still negatively impact the
application performance, illustrated by the two down peaks
in all curves of Fig. 4. However, we can observe that PML
slightly minimizes this impact, by 0.065%-0.95%. The reduc-
tion amplitude depends on the write intensity of the workload.
The reader can see the en-cycled zone in Fig. 4 as an exam-
ple: the PML-100 curve (meaning PML is enabled) is over
noPML-100 curve (the classical solution). The amelioration
increases with the write intensity: compare the read only
(noPML-0 and PML-0) with the write only (noPML-100 and
PML-100) results in Fig. 4.
Fig. 5 focuses on m2, on the execution time of static
int save(...), which affects the migration time. We
can see that PML reducesm2 during live migration by 0.98%-
10.18%. Especially, read intensive applications migrate much
faster when PML in used. Notice that it is very important
to accelerate live migration because it allows to quickly free
a machine for maintenance, place in quarantine a corrupted
VM, etc.
In contrast, we observed that PML does not actually im-
prove checkpointing. This is because VM checkpointing
suspends the execution of the VM, which is not the case dur-
ing live migration. We think that live checkpointing would
likely take advantage of PML.
3.4 WSS estimation
Building on our experience, we implemented in Xen a WSS
estimation prototype which relies on PML. This WSS algo-
rithm is summarized in Section 4.4. Briefly, it consists in
activating PML on the VM. Then, logged GPAs are collected
and the WSS is computed as follows. The working set of the
VM is reached when no new GPAs are seen in the buffer log.
The WSS is the total number of distinct collected GPAs.
However, we concluded that it is not possible to provide
an accurate WSS system using the current PML design. We
identified three limitations (noted Li ) of the latter. Two of
them are hard limitations, making PML not able to accurately
estimate a VM WSS. The latter is a soft limitation which
makes PML unfair for cloud users and their VMs.
(L1) - hard PML does not log all accessed pages. In fact
PML only logs GPAs which are at the origin of the dirty bit
setting (hence the name PML). It means that only page mod-
ifications are recorded. However, the WSS of a VM should
include both read and write accesses. We experimentally
assessed this PML limitation using the synthetic application
presented above (Fig. 2), in which we eliminate the first loop.
We observed for all executions that the estimated WSS was
always the write proportion while the correct WSS is the ar-
ray size. Not counting read accesses within the WSS leads
to memory under-estimation, thus performance degradation
(maybe crash) for the VM.
(L2) - hard Using PML, it is not possible to only track hot
pages, which are the relevant ones for WSS estimation. A
page is said "hot" if it is referenced several times during a
short period of time. In respect with the current PML design,
an accessed page is logged only once. Using this design for
WSS estimation, it is not possible to distinguish "hot" and
"cold" pages, resulting in the over-estimation of the actual
VM memory needs. To assess this limitation, we added to
the synthetic application a for loop at the beginning which
modifies all the array entries. The remaining application code
works on a small portion of the array (noted M, M < N ). In
this case, M is the correct WSS. Using PML, N is reported as
the WSS, leading to memory waste.
This current PML design is sufficient for live migration and
checkpointing because these two operations are only inter-
ested in tracking modified pages. (Counting the number of
modifications for a given page is not necessary).
5
(L3) - soft The handling of PML logging buffer full events
should not be done by the CPU of the VM whose WSS is es-
timated. In fact, depriving the user VM from its CPU quota is
unfair because WSS estimation is only beneficial for the data
center operator. The reader could legitimately say that this
limitation is also true for checkpointing and migration. First
checkpointing is done while the VM is suspended. Second, it
has been proven [53] that a slight reduction of the CPU time
used by the migrated VM accelerates live migration.
We measured the overhead of the current PML design to
estimate the WSS of applications from BigDataBench [16]
(read, write and sort) and HPL Linpack [17] (detailed in
Section 5). To this end, we run each application with and
without PML and compute the overhead, presented in. Fig. 6.
We can see that read intensive applications are not impacted
by PML because it does not solicit PML. However, other
workloads are impacted by PML, with up to 34.8% for HPL
Linpack.
re
ad so
rt
w
ri
te
H
PL
0
20
40
1
·1
0−
3
4.
8 7.
9
34
.9
O
ve
rh
ea
d
(%
)
Figure 6. The impact of PML when it is used for WSS esti-
mation. For BigDataBench applications, the dataset is 10GB.
We can notice with up to 34.8% of performance degradation
for HPL Linpack
3.5 Synthesis
The main conclusion of this study is as follows. First, PML
incurs almost no power consumption overhead. Second, PML
reduces VM live migration time. Third, live checkpointing
could take benefit from PML. Fourth, PML slightly reduces
the performance overhead on applications during live migra-
tion. Fifth, PML does not allow accurate WSS estimation.
Working set estimation is useful for a variety of tasks in
the data center, including:
• fast VM restore, after checkpointing. As shown by [65],
VM restore can be accelerated if only the working set
of the VM is restored, instead of the entire VM mem-
ory.
• optimal processor cache partitioning. Intel Cache
Allocation Technology is a hardware feature which
allows to partition and isolate the last level processor
cache. The determination of the cache size for each
VM is a very tricky task. [7] showed that the WSS of
a VM is the optimal value for its cache size.
• optimal resource utilization using memory overcom-
mitment. Resource waste is one of the main challenges
in todays data centers. Memory overcommitment [10],
which is the capability to dynamically adjust each VM
memory size according to its actual needs, has been
demonstrated by several research work [54] to be a
very promising approach for reducing resource waste.
The next section presents an extension of PML which makes
it effective for WSS estimation.
4 Page Reference Logging
This section presents Page Reference Logging (PRL for short),
an extension of PML for making the latter usable for WSS
estimation. Conceptually, PRL includes two innovations:
• the capability to track both read and write accesses;
• the redirection of log full events to pVM’s CPUs, in-
stead of user VMs.
CPU
VM1 pVM
CPU
VM2
CPU
Hypervisor
RAM
PRL log PRL log
pVM’s memory
Log pre-treatment 
& backup
Backup
PRL activation
Logging
PRL index reset 
PRL deactivation 
Log full VMEXIT/IPI
Log treatmentLog pre-treatment 
& backup
Figure 7. PRL, the design that we propose. It works in
two exclusive modes: PRLPML and PRLPAML . The latter
mode is only presented in this figure. It allows accurate WSS
estimation without impacting user VMs.
4.1 Design
PRL allows the processor to function in two exclusive modes:
PRLPML and PRLPAML (PAML stands for Page Access and
Modification Logging). In PRLPML , the processor works
in the same way as the current PML design, thus allow-
ing PRL to satisfy live migration and checkpointing require-
ments. Concerning PRLPAML , it is enabled in the same way
as PRLPML with the sole difference that the system software
should set a new bit of the Secondary Processor-Based VM-
Execution Controls (e.g. bit 26). This section focuses on
the description of PRLPAML (see Fig. 7), which tackles all
requirements related to WSS estimation.
First, a new 16-bit host-state field called "log full handler
CPU" indicates the index of the CPU to which an interrupt is
6
sent when the PRL log buffer is full. A new 8-bit host-state
field called "log full vector" indicates the interrupt vector
which will be executed by the target CPU on log full inter-
rupt reception. This destination CPU should belong to the
pVM so that the latter services as the execution room of the
WSS estimation system (Fig. 7, · - º), presented in Sec-
tion 4.4. By this way, PRLPAML avoids to schedule out the
VM whose WSS is being estimated. Remember that the pVM
belongs to the data center operator, thus its utilization for
WSS estimation makes sense.
Second, for every GPA which is input of the PRL process
(which starts after an EPT walk, on TLB miss), the following
algorithm takes place:
1. If all bits of PRL index are zero, it means that the
PRL logging buffer has been detected to be full; Then
PRL index is decremented and an interrupt is sent to
the pVM’s CPU which is responsible for handling
log full events. The PRL process ends without the
interruption of the VM whose WSS is being estimated.
The processor starts logging again upon the reset of
PRL index by the system software, especially the log
full event handler (see Section 4.2).
2. If the PRL index is negative, it means that the log full
handler is executing. So we don’t log the gpa and
the PRL process ends; This means that PRL misses
some gpas during the treatment of the execution
of the handler. We claim that this does not affect
the estimation of the WSS because if a missed gpa
belongs to the working set, it is likely to be seen
in the near future (after the re-enabling of the PRL
mechanism) since it is hot. Otherwise, the gpa is
cold and its loss is not an issue. The evaluation results
confirm our claim.
3. Otherwise, PRL index is decremented and gpa is
logged. This is done regardless the value of the dirty
flag. By this way, PRLPAML can log both accessed
and modified pages. In addition, PRLPAML can log
the same page access several times.
4.2 Full event redirection to pVM’s CPUs
To this end, the actual processor sends (through its LAPIC) an
IPI (Inter-Processor Interrupt) to the pVM’s CPU which has
been designated at VMCS configuration time. The LAPIC
is configured with the identifier of that target CPU. We have
introduced a new interrupt vector which points to the full
event handler. This mechanism is similar to "Lightweight
inter-core notifications" introduced by Jefrey C. Mogul et al.
in [49].
Since the full event handler should run in the VMX root
mode because it deals with VMCS data structures, the tar-
get processor must trigger a VMExit upon receiving the IPI.
This behavior is enforced by setting bit 0 of the Pin-Based
VM-Execution Controls of all pVM’s vCPUs. Using this con-
figuration, any external interrupt sent to any CPU which runs
pVM trigger a VMExit.
4.3 Full event handling on pVM’s CPUs
The basic actions that the handler of the log full event should
implement are as follows. First of all, it masks the interrupt
related to the log full event. It then transfers the content
of all PRL logging buffers which are full (for all VMs) to
larger buffers. Notice that each VM is assigned a dedicated
large buffer, which is unique even for a multi-vCPU VM.
During this transfer, the handler accumulates for each GPA
the number of times it has been seen in the PRL logging
buffer. This way, the WSS estimation system (see the next
section) could determine hot pages. After the transfer of
PRL logging buffers, the handler should reset (to the initial
value) the PRL index of all VMCSs which were detected
as full. Notice that the modification of a PRL index only
concerns its VMCS memory region, not the corresponding
processor internal register. In fact, the synchronization of
the VMCS memory region and the processor registers is not
automatic. To enforce this, we introduce a new instruction
which allows to refresh the internal VMCS state of a specific
processor using its corresponding VMCS memory region.
The execution of the handler should end by the unmasking of
the full interrupt.
Notice that this algorithm allows to handle several log full
events using only one generated interrupt. The algorithm is
inspired by the New API (NAPI) implemented in modern
Linux kernels for handling network packet reception [57].
4.4 A PRL-based WSS estimation system
Memory overcommitment [60] consists in dynamically ad-
justing the amount of memory allocated to a VM according to
its actual needs, thus avoiding waste. This on-demand mem-
ory management strategy requires the data center operator to
permanently estimate for each VM its current WSS (noted M).
The latter can then service as input of several optimization
tasks in the data center, see the previous section. This section
presents a WSS estimation system which leverages PRL. This
system runs inside the pVM.
A PRL-based WSS estimation system launches as many
WSS estimation processes as the number of tenant VMs. Each
process calculates M using this equation
M = wss × paдeSize + ε (1)
where wss is the number of hot pages, paдeSize is the size of
a memory page, and ε is the size of the guest kernel footprint.
wss computation The value ofwss is computed using GPAs
logged by PRL. The computation algorithm takes three pa-
rameters as input:
τ : a page which GPA has been logged at least τ times is
considered as a hot page;
7
Figure 8. Illustration of the WSS estimation algorithm.
1 i n t app ( ) {
2 f o r ( i =0 ; i <D; i ++)
3 f o r ( j =0 ; j <N; j ++)
4 o p e r a t i o n ( t a b [ j ] ) ;
5 }
Figure 9. The second synthetic application template.
ω: the WSS stability duration, used to determine whether
the VM has already covered its working set;
µ: the observation interval.
The values of these parameters are decided by the external en-
tity which launches the WSS estimation system. [67] presents
a set of methods which can be used to determine the values
of these parameters.
The estimation algorithm works as follows. Let us note
@buff the address of the cumulative PRL buffer of the VM
whose WSS is calculated. The WSS estimation algorithm
consists of a while loop. For each iteration i, the number of
distinct GPAs present in @buff which have been logged more
than τ times is computed and stored in dist[i]. The loop ends
when dist[i]−dist[i −ω] = 0. This condition is true when the
VM has touched/referenced all memory pages belonging to
its current working set. Otherwise, the process sleeps during
µ seconds before continuing the iteration. Fig. 8 illustrates
this algorithm. We can see the evolution of dist[i] over the
time that corresponds to an increasing monotonic function.
ε computation The value of ε depends on the guest kernel
binary. It is estimated once by the data center operator for
each kernel binary using the following algorithm:
1. Starts a 2GB VM from the kernel binary;
2. Initialize ε and curMem (an auxiliary variable) to
2GB;
3. Set curMem to 95% × ε;
4. Change the VM memory size to curMem;
5. If the VM crashes then stop the algorithm and return
ε;
6. Else set ε to curMem and go to step 3.
We provide a tool that automates these steps for a machine
virtualized with Xen hypervisor.
5 PRL Evaluation
This section presents the evaluation results of PRL and the
WSS estimation system which relies on it. The evaluations
cover two aspects: accuracy and overhead. The former is
the capability of PRL to accurately estimate the WSS of a
VM. The WSS estimation algorithm is configured as follows:
τ = 50, µ = 30s, and ω = 120s (soit 4 ∗ µ). The value of these
parameters were found empirically. Concerning the overhead,
we evaluated: (1) the additional power consumption incurred
by PRL, (2) the impact incurred by PRL on the VM whose
WSS is estimated (referred to as "the target VM"), and (3)
the amount of resources consumed by the WSS estimation
system inside the pVM.
5.1 Experimental environment
The experimental environment is the same as the one pre-
sented in Section 3.1 (which uses a real hardware), completed
with the hardware simulator Gem5 [26]. The combination
of these two environments allows us to emulate a machine
which implements PRL, see below. We chose Gem5 because
it is a very popular hardware simulator, which has been used
by up to 97 research papers at the time this paper is writ-
ten (according to [2]). Gem5 allows the execution of a full
Linux distribution. We improved it in order to simulate a
virtualized system. This improvement consists in: (1) the
addition of the Extended Page Table (EPT), (2) the extension
of the hardware page table walker so that it performs a 2D
page walk through the EPT and (3) the implementation of
PRL/PML logging mechanisms. The emulation methodology
is as follows. The application is first executed under Gem5
and the logged GPAs are collected, including logging instants
given by the simulator. The collected traces are then replayed
inside the dom0 of a real virtualized environment which runs
the WSS estimation system. A dom0’s CPU (noted CPU0) is
dedicated to the latter. The other dom0’s CPUs run processes
which replay the traces, thus mimicking the functioning of
a real PRL equipped machine. Each process which replays
the traces send an IPI to CPU0 every time it has played N
traces, N being the size of the logging buffer. This emulation
approximates the real functioning of a PRL-capable machine,
depicted in Fig. 7. In order to be fair, we also evaluated the
accuracy of PML and VMware’s solution following the same
methodology. Recall that VMware’s WSS estimation solu-
tion consists in periodically (every 30s) selecting a sample of
100 memory pages whose present bits are invalidated. The
proportion of pages, among these selected 100 pages, which
will lead to a page fault represents the WSS of the VM.
Concerning the application which runs inside the VMs, we
also use a variant of the synthetic application presented in
Section 3.1. This second variant follows the template shown
in Fig. 9. N allows to control the WSS (N × 4KB), D repre-
sents the duration of the application, and operation(tab[j]) is
the operation performed on the array entry. We consider three
workload types, in respect with operation(tab[j]): (RWRW)
8
 0
 100
 200
 300
 400
 500
 0  100  200  300
W
SS
 (M
B)
Time (sec)
RWRW workload
PML
PRL
VMware
Expected
 0
 100
 200
 300
 400
 500
 0  100  200  300
W
SS
 (M
B)
Time (sec)
RRWW workload
PML
PRL
VMware
Expected
 0
 100
 200
 300
 400
 500
 0  100  200  300
W
SS
 (M
B)
Time (sec)
WWRR workload
PML
PRL
VMware
Expected
 0
 500
 1000
 1500
 2000
 0  100  200
W
SS
 (M
B)
Time (sec)
HPL Linpack
PML
PRL
VMware
Figure 10. Accuracy of PRL compared with PML and VMware.
every read is followed by a write; (RRWW) a set of reads fol-
lowed by a set of writes; and (WWRR) a set of writes followed
by a set of reads.
5.2 Accuracy
Given a constant workload which runs inside the VM, we
are interested in verifying if PRL is able to accurately esti-
mate the WSS (the number of hot pages) of that workload.
To realize this evaluation, we used the second variant of the
synthetic application and HPL Linpack. We did not use Big-
DataBench applications because their execution under Gem5
never completed because the latter is a very slow simulator,
thus not suitable for such bigdata applications.
Fig. 10 presents the results, interpreted as follows. Concern-
ing the synthetic application, its WSS is known in advance
(400 MB). We can see that PRL is accurate for all work-
load types with an error margin lower than 1MB. VMware’s
solution appears inaccurate, which is inline with previous
research observations [54]. The accuracy of PML depends
on the amount of write operations. It accurately estimates
RWRW and WWRR WSS because during their execution,
all the array entries are referenced by write operations. This
is not true during the first step of RRWW, explaining PML
inaccuracy.
About HPL, we observed that both PRL and PML are
accurate, not VMware. We validated the accuracy (as the
WSS is not known in advance) with the following protocol.
Relying on the results obtained with PRL, we dynamically
adjust at runtime the VM memory size. We observed no
VM crashing and no performance degradation, meaning that
either PRL has overestimated the WSS or it is accurate. We
repeated the experiment while subtracting 100MB from PRL
generated values (1700 MB), which led to VM crashing. As
a conclusion, PRL obtained values were accurate.
During these experiments, we observed that the number
of missed GPAs during the handling of full buffer events is
negligible. Also, a GPA which is missed at round i is seen
in round i + x when that GPA is part of the working set. For
illustration, table 1 shows the number of missed GPAs during
the execution of a synthetic application and HPL. We also
observed (with two buffer sizes in table 1) that the size of the
logging buffer does not impact this result, even if the buffer
processing is longer.
After demonstrating the accuracy of PRL, we used it for
estimating the WSS of PARSEC applications [12]. The lat-
ter are often used by researchers, thus knowing their WSS
is likely to interest several researchers. Fig. 11 presents the
results. We can see that PML is accurate for six applications
(streamcluster, vips, fluidaminate, blacksholes, bodytrack,
dedup). VMware’s WSS estimation solution leads to inac-
curate values. By making the implementation of PRL under
Gem5 publicly available, researchers can use it for estimating
the WSS of other benchmarks.
5.3 Overhead
5.3.1 Overhead on the VM which WSS is computed
This overhead is the total number of CPU cycles used by PRL
circuitry (noted TCircuitryPRL ). Having not yet a PRL machine,
we assume thatTCircuitryPRL equalsT
Circuitry
PML , according to the
slight difference between the two modes. Therefore, to evalu-
ate the overhead of PRL, we used a PML-capable machine
on which the VM runs a read only application. This scenario
avoids VMExits related to PML, thus reproducing the effect
of PRL on the VM (since VMExits in PRL are executed on
a different core). We repeated this experiment while varying
the number of tenant VMs in order to increase the pressure on
the TLB. Recall that PML (as well as PRL) takes place dur-
ing page table walk, on each TLB miss. We are interested in
the performance difference of the application with and with-
out PML. We measured no performance difference (about
0.001%), meaning that PRL will likely incur no performance
degradation on the VM which WSS is computed.
5.3.2 Resource utilization in the pVM
We focus on CPU consumption. To this end, we used the
emulated environment. The WSS estimation system is dedi-
cated a CPU core. The write intensive synthetic application is
used for this experiment because it is the worse case. We per-
formed the experiment with one VM. Fig. 12 top presents the
percentage of CPU consumed by the WSS estimation process
in the dom0 for a single-vCPU VM. Only a representative
portion of results is presented. The CPU consumed during the
treatment of the buffer full event (computation of dist[i], see
Section 4.4) increases (with the size of the cumulative buffer)
until the WSS is discovered. However, we can observe that
the CPU is most of the time idle, waiting for the PRL logging
9
 0
 2
 4
 6
 8
 10
 0  20  40  60  80  100  120
W
SS
 (M
B)
Time
blackscholes
PML
PRL
VMware
 0
 2
 4
 6
 8
 10
 12
 0  20  40  60  80  100  120
W
SS
 (M
B)
Time
bodytrack
PML
PRL
VMware
 0
 50
 100
 150
 200
 250
 300
 350
 0  20  40  60  80  100  120  140  160  180
W
SS
 (M
B)
Time
canneal
PML
PRL
VMware
 0
 50
 100
 150
 200
 250
 300
 0  20  40  60  80  100  120  140  160  180
W
SS
 (M
B)
Time
dedup
PML
PRL
VMware
 0
 50
 100
 150
 200
 250
 0  20  40  60  80  100  120  140  160  180
W
SS
 (M
B)
Time
facesim
PML
PRL
VMware
 0
 10
 20
 30
 40
 50
 60
 70
 80
 0  20  40  60  80  100  120
W
SS
 (M
B)
Time
ferret
PML
PRL
VMware
 0
 10
 20
 30
 40
 50
 60
 70
 0  20  40  60  80  100  120
W
SS
 (M
B)
Time
fluidanimate
PML
PRL
VMware
 0
 5
 10
 15
 20
 0  20  40  60  80  100  120
W
SS
 (M
B)
Time
freqmine
PML
PRL
VMware
 0
 2
 4
 6
 8
 10
 12
 14
 16
 0  20  40  60  80  100  120
W
SS
 (M
B)
Time
streamcluster
PML
PRL
VMware
 0
 5
 10
 15
 20
 0  20  40  60  80  100  120
W
SS
 (M
B)
Time
swaptions
PML
PRL
VMware
 0
 2
 4
 6
 8
 10
 12
 14
 16
 0  20  40  60  80  100  120
W
SS
 (M
B)
Time
vips
PML
PRL
VMware
 0
 5
 10
 15
 20
 0  20  40  60  80  100  120
W
SS
 (M
B)
Time
x264
PML
PRL
VMware
Figure 11. WSS estimation of PARSEC applications using PRL.
Buffer size (MB) RRWW HPL
# full events # missed GPAs % of missed GPAs # full events # missed GPAs % of missed GPAs
512 17094 0 0 20701 741 0.04
1024 8543 213 0.02 10510 116 0.01
Table 1. Number of missed GPAs during the treatment of buffer full events. We can see that the number of missed GPAs is
negligible, thus not impacting the WSS estimation algorithm.
buffer full event. In average, the total CPU time consumed
by the WSS estimation process for a single-vCPU VM is less
than 1.5%. We repeated the same experiment while varying
the number of single-vCPU VMs. Fig. 12 bottom shows
that the average percentage of CPU consumed by the WSS
estimation system increases linearly.
5.3.3 Power consumption
Power consumption overhead due to PRL corresponds to
the power consumed by TCircuitryPRL . It is equal to the power
consumed by TCircuitryPML . Therefore, we evaluate T
Circuitry
PRL
using a PML-capable machine on which a read intensive
application runs. This experiment is the same presented in
Section 3.2, see noPML-0 and PML-0 curves in Fig. 3. Recall
that we observed almost no additional power consumption
due to PML, thus PRL.
6 Related work
This section presents the related work.
Hardware assisted virtualization (HAV) Contrary to A. Bau-
mann’s conclusion in his HotOS’17 paper [24], we believe
that virtualization is the System’s sub-domain which mainly
influences hardware architecture research. In fact, HAV con-
tributions have evolved at the rhythm of the limitations of soft-
ware solutions. Notably, Extended/Nested Page Table [25]
has been introduced for addressing the tremendous number of
context switches caused by shadow paging [60]. In summary,
several hardware features (e.g., Intel VT [8], AMD VT [4],
AMD SEV [3], VMFUNC [13], Intel CAT [9], APICv [50])
have been integrated inside CPU, memory subsystems, I/O de-
vices and many other motherboard components by hardware
manufacturers these recent years for achieving basic virtual-
ization functionalities. An extreme application of the HAV
approach has been proposed by E. Keller with NoHype [40]
which is a hardware only hypervisor. In 2017, Amazon
anounced its new hypervisor called Nitro [43], which can
be seen as a concretization of the NoHype vision. In the
10
 0
 10
 20
 30
 40
 50
 60
CP
U (%
)
Time
With one VM
2 4 6 8 10
0
5
10
15
20
# single-vCPU VM
C
PU
(%
)
Figure 12. (top) Percentage of CPU time consumed by the
WSS estimation process inside the pVM during the execution
of a single-vCPU VM. (bottom) The average consumption
when varying the number of VMs.
academia, a lot of efforts have been made in the topic of mem-
ory virtualization [19, 20, 22, 34, 35, 44, 62, 63] to minimize
the overhead of the 2D page walk imposed by EPT.
Page tracking The most popular approach for page tracking
consists in denying access to memory pages which need to
be monitored so that next accesses trap inside the system
software (hypervisor or OS). This approach is used by the
majority of checkpointing, live migration and WSS estima-
tion solutions. Very few research works have investigated
hardware features for page access tracking.
Pin Zhou et al. [68] proposed a Miss Ratio Curve (MRC)
monitoring hardware feature which can be used as an alterna-
tive to page access tracking in the task of WSS estimation [67].
[68] proposes a solution which consists in snooping the ad-
dress bus and requires a collaboration with the OS page fault
handler. As PML/PRL, [68] showed that tracking page ac-
cesses at the hardware level is possible. However, [68] is
dedicated to native systems and it needs to collaborate with
the OS, unlike PML/PRL.
Checkpointing and live migration Extensive research stud-
ies have investigated these two topics. Concerning VM live
migration, [48, 58] presented surveys that the reader could
refer to. We would like to highlight among them [18, 51] who
studied live migration of the VM storage along with its mem-
ory. Very few research work has investigated VM storage live
migration because it increases the VM downtime. Migrating
the VM storage is necessary when dealing with data inten-
sive applications because they generally use local storage
instead of the classical network storage. The utilization of
PRL/PML is also beneficial for this use case because disk
accesses go through the buffer cache, which resides in RAM.
[66] addressed another important aspect of live migration
which is the prediction of the right migration instant. In fact,
live migration could fail due to lack of resources or sudden
VM behavior changes. The authors use the VM WSS to track
such behavior changes. Thus PRL is likely to improve [66]’s
contribution.
Concerning VM checkpointing, we would like to high-
light [64] who focused on optimizing VM restore. This op-
eration has not been extensively studied for checkpointing
although it is of much importance. In fact, a quick restore
reduces the duration of service unavailability after VM failure
detection. The authors minimize disk accesses by optimizing
prefectching. This work is orthogonal to our contribution.
WSS estimation Committed_AS, a Linux kernel statistic, is
generally used (e.g., by Xen) to estimate the VM WSS. This
statistic corresponds to the total number of anonymous mem-
ory pages allocated by all processes, but not necessary backed
by physical pages. Therefore, Committed_AS over-estimates
the WSS. Another limitation of this approach is the fact that
it requires a collaboration between the hypervisor and the
guest OS. Zballoond [29] relies on the following observa-
tion: when a VM’s memory size is larger than or equal to
its WSS, the number of swap-in and refault (occurs when
a previously evicted page is later accessed) events is close
to zero. The basic idea behind Zballoond consists in gradu-
ally decreasing the VM’s memory size until these counters
start to increase. The VM’s WSS is the lowest memory size
which leads the VM to zero swap-in and refault events. Like
Committed_AS, Zballoond requests the collaboration with the
guest kernel. Furthermore, Zballoond is very active in the
sense that it performs memory pressure on the VM, which
could degrade the VM performance. Geiger [37] monitors
the evictions and subsequent reloads from the guest OS buffer
cache from/to the swap device. It relies on a ghost buffer [55]
which represents an imaginary memory buffer which extends
the VM’s physical memory (noted mcur ). The size of this
buffer (notedmдhost ) represents the amount of extra memory
which would prevent the VM from swapping-out. Knowing
the ghost buffer size, the VM’s WSS can be computed using
the following formula: WSS =mcur +mдhost ifmдhost > 0.
Unlike the two previous solutions, Geiger is transparent from
the VM’s point of view. However, Geiger has an important
drawback which derives from its non-intrusiveness. It is able
to estimate the WSS only when the size of the ghost buffer is
greater than zero (the VM is in a swapping state). Geiger is in-
efficient if the VM’s WSS is smaller than the current memory
allocation. Hypervisor Exclusive Cache [47] is fairly similar
to Geiger. Badis [52] combined VMware and Geiger in order
to take advantage of their non intrusivity on the VM’s code-
base. Badis suffers from VMware and Geiger’s drawbacks
presented above. [67] computes the WSS of an application
based on its miss-ratio curve (MRC). The latter shows the
fraction of the cache misses that would turn into cache hits
11
if the VM’s allocated memory increases. [67] presents a set
of methods to determine the values of the input parameters
of our WSS estimation system (τ , ω, and µ). [42] presents an
application-assisted WSS estimation solution in virtualized
systems. In contrast to our solution which considers the VM
as a black blox, [42] relied on the application inside the VM
to estimate the WSS, which is very intrusive.
7 Conclusion
This paper presents a thorough analysis of Page Modification
Logging (PML), a memory page access tracking technology
introduced by Intel and VMware as a key virtualization func-
tionality. We show that although the current design of PML
makes it effective for VM live migration and checkpointing,
it is not appropriate for working set size estimation (WSS). In
the light of our analysis, we propose Page Reference Logging,
an extension to PML which makes it also effective for WSS
estimation. We implemented PRL in Gem5, a popular hard-
ware simulator and described a WSS estimation system which
leverages PRL. We evaluated our solution using both real and
synthetic applications, and compared it with VMware’s WSS
estimation solution. Our results demonstrate that, unlike
VMware, our solution is both accurate and does not impact
user VMs.
References
[1] [n. d.]. ([n. d.]). http://www.linux-kvm.org Visited on December
2018.
[2] [n. d.]. ([n. d.]). http://gem5.org/Publications Visited on July 2019.
[3] [n. d.]. AMD Secure Encrypted Virtualization (SEV). ([n. d.]). https:
//developer.amd.com/sev/ Visited on April 2019.
[4] [n. d.]. AMD-V™ Technology for Client Virtualization. ([n. d.]).
https://www.amd.com/en/technologies/virtualization Visited on
April 2019.
[5] [n. d.]. ’Dirty Page Logs’ coming to future vSphere re-
lease. https://www.theregister.co.uk/2016/05/11/dirty_page_logs_
coming_to_future_vsphere_release/. ([n. d.]).
[6] [n. d.]. Hyper-V Architecture. ([n. d.]). http://msdn.microsoft.com/
en-us/library/cc768520.aspx. Visited on December 2018.
[7] [n. d.]. improving real-time performance by utilizing cache
allocation technology. White paper. https://www.intel.com/
content/dam/www/public/us/en/documents/white-papers/
cache-allocation-technology-white-paper.pdf. ([n. d.]). Visited on
July 2019.
[8] [n. d.]. Intel® Virtualization Technology (Intel® VT). ([n.
d.]). https://www.intel.com/content/www/us/en/virtualization/
virtualization-technology/intel-virtualization-technology.html Vis-
ited on April 2019.
[9] [n. d.]. Introduction to Cache Allocation Technology in the Intel®
Xeon® Processor E5 v4 Family. ([n. d.]). https://software.intel.com/
en-us/articles/introduction-to-cache-allocation-technology Vis-
ited on April 2019.
[10] [n. d.]. Memory Overcommit. https://www.techopedia.com/
definition/14761/memory-overcommi. ([n. d.]). Visited on July
2019.
[11] [n. d.]. Page-Modification Logging for Virtual-Machine Mon-
itor. https://www.intel.com/content/www/us/en/processors/
page-modification-logging-vmm-white-paper.html. ([n. d.]).
[12] [n. d.]. PARSEC. ([n. d.]). http://parsec.cs.princeton.edu/ Visited
on April 2019.
[13] [n. d.]. VMFUNC - Invoke VM function. ([n. d.]). https://www.
felixcloutier.com/x86/vmfunc Visited on April 2019.
[14] [n. d.]. VMKernel Architecture. ([n. d.]). https://communities-gbot.
vmware.com/thread/93870 Visited on December 2018.
[15] [n. d.]. VMware VMkernel. ([n. d.]). https://microage.com/
wp-content/uploads/2016/02/ESXi_architecture.pdf. Visited on
December 2018.
[16] 2019. A Scalable Big Data and AI Benchmark Suite, ICT, Chinese
Academy of Sciences. http://prof.ict.ac.cn/BigDataBench/. (March
2019).
[17] 2019. HPL - A Portable Implementation of the High-Performance
Linpack Benchmark for Distributed-Memory Computers. http://www.
netlib.org/benchmark/hpl/. (March 2019).
[18] P. Santhi Thilagam Abhinit Modi, Raghavendra Achar. 2017. Live
Migration of Virtual Machines with Their Local Persistent Storage in
a Data Intensive Cloud. Int. J. High Perform. Comput. Netw. 10, 1-2
(Jan. 2017), 134–147. http://dl.acm.org/citation.cfm?id=3070823.
3070837
[19] Jeongseob Ahn, Seongwook Jin, and Jaehyuk Huh. 2012. Revisiting
Hardware-assisted Page Walks for Virtualized Systems. In Proceedings
of the 39th Annual International Symposium on Computer Architecture
(ISCA ’12). IEEE Computer Society, Washington, DC, USA, 476–487.
http://dl.acm.org/citation.cfm?id=2337159.2337214
[20] Hanna Alam, Tianhao Zhang, Mattan Erez, and Yoav Etsion. 2017.
Do-It-Yourself Virtual Memory Translation. In Proceedings of the
44th Annual International Symposium on Computer Architecture (ISCA
’17). ACM, New York, NY, USA, 457–468. https://doi.org/10.1145/
3079856.3080209
[21] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris,
Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. 2003. Xen
and the art of virtualization. In IN SOSP. 164–177.
[22] Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Translation
Caching: Skip, Don’T Walk (the Page Table). In Proceedings of the
37th Annual International Symposium on Computer Architecture (ISCA
’10). ACM, New York, NY, USA, 48–59. https://doi.org/10.1145/
1815961.1815970
[23] Luiz Andre Barroso, Urs Hoelzle, and Parthasarathy Ranganathan.
2018. The Datacenter as a Computer: Designing Warehouse-Scale
Machines (3rd ed.). Morgan and Claypool Publishers.
[24] Andrew Baumann. 2017. Hardware is the New Software. In Pro-
ceedings of the 16th Workshop on Hot Topics in Operating Sys-
tems (HotOS ’17). ACM, New York, NY, USA, 132–137. https:
//doi.org/10.1145/3102980.3103002
[25] Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha
Manne. 2008. Accelerating Two-dimensional Page Walks for Vir-
tualized Systems. In Proceedings of the 13th International Confer-
ence on Architectural Support for Programming Languages and Op-
erating Systems (ASPLOS XIII). ACM, New York, NY, USA, 26–35.
https://doi.org/10.1145/1346281.1346286
[26] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Rein-
hardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower,
Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell,
Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood.
2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2
(Aug. 2011), 1–7. https://doi.org/10.1145/2024716.2024718
[27] Marc Brooker and Holly Mesrobian. 2018. A Serverless Journey:
Under the Hood of AWS Lambda. https://www.youtube.com/watch?
v=QdzV04T_kec. (Nov. 2018).
[28] Jui-Hao Chiang, Han-Lin Li, and Tzi cker Chiueh. 2013. Working
Set-based Physical Memory Ballooning. In Proceedings of the 10th In-
ternational Conference on Autonomic Computing (ICAC 13). USENIX,
San Jose, CA, 95–99. https://www.usenix.org/conference/icac13/
technical-sessions/presentation/chiang
12
[29] Jui-Hao Chiang, Han-Lin Li, and Tzi cker Chiueh. 2013. Working
Set-based Physical Memory Ballooning. In Proceedings of the 10th In-
ternational Conference on Autonomic Computing (ICAC 13). USENIX,
San Jose, CA, 95–99. https://www.usenix.org/conference/icac13/
technical-sessions/presentation/chiang
[30] Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric
Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. 2005. Live
Migration of Virtual Machines. In Proceedings of the 2Nd Conference
on Symposium on Networked Systems Design & Implementation - Vol-
ume 2 (NSDI’05). USENIX Association, Berkeley, CA, USA, 273–286.
http://dl.acm.org/citation.cfm?id=1251203.1251223
[31] Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Mar-
cus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Under-
standing and Predicting Workloads for Improved Resource Manage-
ment in Large Cloud Platforms. In Proceedings of the 26th Symposium
on Operating Systems Principles (SOSP ’17). ACM, New York, NY,
USA, 153–167. https://doi.org/10.1145/3132747.3132772
[32] Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-
efficient and QoS-aware Cluster Management. In Proceedings of
the 19th International Conference on Architectural Support for Pro-
gramming Languages and Operating Systems (ASPLOS ’14). ACM,
New York, NY, USA, 127–144. https://doi.org/10.1145/2541940.
2541941
[33] Peter J. Denning. 1968. The Working Set Model for Program Behavior.
Commun. ACM 11, 5 (May 1968), 323–333. https://doi.org/10.1145/
363095.363141
[34] Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M. Swift.
2014. Efficient Memory Virtualization: Reducing Dimensionality of
Nested Page Walks. In Proceedings of the 47th Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO-47). IEEE
Computer Society, Washington, DC, USA, 178–189. https://doi.org/
10.1109/MICRO.2014.37
[35] Jayneel Gandhi, Mark D. Hill, and Michael M. Swift. 2016. Ag-
ile Paging: Exceeding the Best of Nested and Shadow Paging. In
Proceedings of the 43rd International Symposium on Computer Ar-
chitecture (ISCA ’16). IEEE Press, Piscataway, NJ, USA, 707–718.
https://doi.org/10.1109/ISCA.2016.67
[36] Stephen T. Jones, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-
Dusseau. 2006. Geiger: Monitoring the Buffer Cache in a Virtual
Machine Environment. In Proceedings of the 12th International Con-
ference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS XII). ACM, New York, NY, USA, 14–24.
https://doi.org/10.1145/1168857.1168861
[37] Stephen T. Jones, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-
Dusseau. 2006. Geiger: Monitoring the Buffer Cache in a Virtual
Machine Environment. SIGARCH Comput. Archit. News 34, 5 (Oct.
2006), 14–24. https://doi.org/10.1145/1168919.1168861
[38] Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur
Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov,
Íñigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao.
2016. Morpheus: Towards Automated SLOs for Enterprise Clusters. In
Proceedings of the 12th USENIX Conference on Operating Systems De-
sign and Implementation (OSDI’16). USENIX Association, Berkeley,
CA, USA, 117–134. http://dl.acm.org/citation.cfm?id=3026877.
3026887
[39] Harshad Kasture and Daniel Sánchez. 2016. Tailbench: a benchmark
suite and evaluation methodology for latency-critical applications. In
2016 IEEE International Symposium on Workload Characterization,
IISWC 2016, Providence, RI, USA, September 25-27, 2016. 3–12.
https://doi.org/10.1109/IISWC.2016.7581261
[40] Eric Keller, Jakub Szefer, Jennifer Rexford, and Ruby B. Lee. 2010.
NoHype: Virtualized Cloud Infrastructure Without the Virtualization.
SIGARCH Comput. Archit. News 38, 3 (June 2010), 350–361. https:
//doi.org/10.1145/1816038.1816010
[41] Jinchun Kim, Viacheslav Fedorov, Paul V. Gratz, and A. L. Narasimha
Reddy. 2015. Dynamic Memory Pressure Aware Ballooning. In
Proceedings of the 2015 International Symposium on Memory Sys-
tems (MEMSYS ’15). ACM, New York, NY, USA, 103–112. https:
//doi.org/10.1145/2818950.2818967
[42] Min Lee, A. S. Krishnakumar, P. Krishnan, Navjot Singh, and Shalini
Yajnik. 2011. Hypervisor-assisted Application Checkpointing in
Virtualized Environments. In Proceedings of the 2011 IEEE/IFIP
41st International Conference on Dependable Systems&Networks
(DSN ’11). IEEE Computer Society, Washington, DC, USA, 371–382.
https://doi.org/10.1109/DSN.2011.5958250
[43] Anthony Liguori. 2018. Powering Next-Gen EC2 Instances – Deep
Dive into the Nitro System. https://www.youtube.com/watch?v=
e8DVmwj3OEs. (Nov. 2018).
[44] Jin Tack Lim, Christoffer Dall, Shih-Wei Li, Jason Nieh, and Marc
Zyngier. 2017. NEVE: Nested Virtualization Extensions for ARM. In
Proceedings of the 26th Symposium on Operating Systems Principles
(SOSP’17), Shanghai, China, October 28-31, 2017. 201–217.
[45] Kevin T. Lim, Jichuan Chang, Trevor N. Mudge, Parthasarathy Ran-
ganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Dis-
aggregated Memory for Expansion and Sharing in Blade Servers. In
Proceedings of the 2009 ACM/IEEE Annual International Symposium
on Computer Architecture (ISCA’09). ACM, 267–278.
[46] Pin Lu and Kai Shen. 2007. Virtual Machine Memory Access Tracing
with Hypervisor Exclusive Cache. In 2007 USENIX Annual Technical
Conference on Proceedings of the USENIX Annual Technical Confer-
ence (ATC’07). USENIX Association, Berkeley, CA, USA, Article 3,
15 pages. http://dl.acm.org/citation.cfm?id=1364385.1364388
[47] Pin Lu and Kai Shen. 2007. Virtual Machine Memory Access Tracing
with Hypervisor Exclusive Cache. In 2007 USENIX Annual Technical
Conference on Proceedings of the USENIX Annual Technical Confer-
ence (ATC’07). USENIX Association, Berkeley, CA, USA, Article 3,
15 pages. http://dl.acm.org/citation.cfm?id=1364385.1364388
[48] Violeta Medina and Juan Manuel García. 2014. A Survey of Migration
Mechanisms of Virtual Machines. ACM Comput. Surv. 46, 3, Article
30 (Jan. 2014), 33 pages. https://doi.org/10.1145/2492705
[49] Jeffrey C. Mogul, Andrew Baumann, Timothy Roscoe, and Livio
Soares. 2011. Mind the gap: Reconnecting architecture and OS re-
search. In Proceedings of the 13th Workshop on Hot Topics in Operat-
ing Systems (HotOS).
[50] Khang T (Intel) Nguyen. [n. d.]. APIC Virtual-
ization Performance Testing and Iozone. ([n. d.]).
https://software.intel.com/en-us/blogs/2013/12/17/
apic-virtualization-performance-testing-and-iozone Visited on
December 2018.
[51] Bogdan Nicolae and Franck Cappello. 2012. A Hybrid Local Storage
Transfer Scheme for Live Migration of I/O Intensive Workloads. In
Proceedings of the 21st International Symposium on High-Performance
Parallel and Distributed Computing (HPDC ’12). ACM, New York,
NY, USA, 85–96. https://doi.org/10.1145/2287076.2287088
[52] Vlad Nitu, Aram Kocharyan, Hannas Yaya, Alain Tchana, Daniel
Hagimont, and Hrachya Astsatryan. 2018. Working Set Size Estimation
Techniques in Virtualized Environments: One Size Does Not Fit All.
Proc. ACM Meas. Anal. Comput. Syst. 2, 1, Article 19 (April 2018),
22 pages. https://doi.org/10.1145/3179422
[53] Vlad Nitu, Pierre Olivier, Alain Tchana, Daniel Chiba, Antonio Bar-
balace, Daniel Hagimont, and Binoy Ravindran. 2017. Swift Birth
and Quick Death: Enabling Fast Parallel Guest Boot and Destruc-
tion in the Xen Hypervisor. In Proceedings of the 13th ACM SIG-
PLAN/SIGOPS International Conference on Virtual Execution En-
vironments (VEE ’17). ACM, New York, NY, USA, 1–14. https:
//doi.org/10.1145/3050748.3050758
[54] Vlad Nitu, Boris Teabe, Alain Tchana, Canturk Isci, and Daniel Hagi-
mont. 2018. Welcome to Zombieland: Practical and Energy-efficient
13
Memory Disaggregation in a Datacenter. In Proceedings of the Thir-
teenth EuroSys Conference (EuroSys ’18). ACM, New York, NY, USA,
Article 16, 12 pages. https://doi.org/10.1145/3190508.3190537
[55] R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka.
1995. Informed Prefetching and Caching. SIGOPS Oper. Syst. Rev. 29,
5 (Dec. 1995), 79–95. https://doi.org/10.1145/224057.224064
[56] Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018.
LegoOS: A Disseminated, Distributed OS for Hardware Resource
Disaggregation. In Proceedings of the 12th USENIX Conference on
Operating Systems Design and Implementation (OSDI’18). USENIX
Association, Carlsbad, CA, USA, 69–87.
[57] Bharathi Subramanian. [n. d.]. NAPI - The New API for Linux Network
Drivers. ([n. d.]). https://bharathisubramanian.wordpress.com/
2010/04/13/napi-the-new-api-for-linux-network-drivers/ Visited
on December 2018.
[58] Petter Svärd, Benoit Hudzia, Steve Walsh, Johan Tordsson, and Erik
Elmroth. 2015. Principles and Performance Characteristics of Algo-
rithms for Live VM Migration. SIGOPS Oper. Syst. Rev. 49, 1 (Jan.
2015), 142–155. https://doi.org/10.1145/2723872.2723894
[59] Rich Uhlig, Gil Neiger, Dion Rodgers, Amy L. Santoni, Fernando
C. M. Martins, Andrew V. Anderson, Steven M. Bennett, Alain Kagi,
Felix H. Leung, and Larry Smith. 2005. Intel Virtualization Technology.
Computer 38, 5 (May 2005), 48–56. https://doi.org/10.1109/MC.
2005.163
[60] Carl A. Waldspurger. 2002. Memory Resource Management in
VMware ESX Server. SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002),
181–194. https://doi.org/10.1145/844128.844146
[61] Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and
Michael Swift. 2018. Peeking Behind the Curtains of Serverless Plat-
forms. In 2018 USENIX Annual Technical Conference (USENIX ATC
18). USENIX Association, Boston, MA, 133–146.
[62] Xiaolin Wang, Jiarui Zang, Zhenlin Wang, Yingwei Luo, and Xiaoming
Li. 2011. Selective Hardware/Software Memory Virtualization. In Pro-
ceedings of the 7th ACM SIGPLAN/SIGOPS International Conference
on Virtual Execution Environments (VEE ’11). ACM, New York, NY,
USA, 217–226. https://doi.org/10.1145/1952682.1952710
[63] Idan Yaniv and Dan Tsafrir. 2016. Hash, Don’T Cache (the Page
Table). In Proceedings of the 2016 ACM SIGMETRICS Interna-
tional Conference on Measurement and Modeling of Computer Sci-
ence (SIGMETRICS ’16). ACM, New York, NY, USA, 337–350.
https://doi.org/10.1145/2896377.2901456
[64] Irene Zhang, Tyler Denniston, Yury Baskakov, and Alex Garthwaite.
2013. Optimizing VM Checkpointing for Restore Performance in
VMware ESXi. In Proceedings of the 2013 USENIX Conference on
Annual Technical Conference (USENIX ATC’13). USENIX Associa-
tion, Berkeley, CA, USA, 1–12. http://dl.acm.org/citation.cfm?id=
2535461.2535463
[65] Irene Zhang, Alex Garthwaite, Yury Baskakov, and Kenneth C. Barr.
2011. Fast Restore of Checkpointed Memory Using Working Set
Estimation. In Proceedings of the 7th ACM SIGPLAN/SIGOPS Interna-
tional Conference on Virtual Execution Environments (VEE ’11). ACM,
New York, NY, USA, 87–98. https://doi.org/10.1145/1952682.
1952695
[66] Jinshi Zhang, Eddie Dong, Jian Li, and Haibing Guan. 2017. MigVisor:
Accurate Prediction of VM Live Migration Behavior Using a Working-
Set Pattern Model. SIGPLAN Not. 52, 7 (April 2017), 30–43. https:
//doi.org/10.1145/3140607.3050753
[67] Weiming Zhao, Xinxin Jin, Zhenlin Wang, Xiaolin Wang, Yingwei Luo,
and Xiaoming Li. 2011. Low Cost Working Set Size Tracking. In Pro-
ceedings of the 2011 USENIX Conference on USENIX Annual Techni-
cal Conference (USENIXATC’11). USENIX Association, Berkeley, CA,
USA, 17–17. http://dl.acm.org/citation.cfm?id=2002181.2002198
[68] Pin Zhou, Vivek Pandey, Jagadeesan Sundaresan, Anand Raghuraman,
Yuanyuan Zhou, and Sanjeev Kumar. 2004. Dynamic Tracking of
Page Miss Ratio Curve for Memory Management. SIGOPS Oper. Syst.
Rev. 38, 5 (Oct. 2004), 177–188. https://doi.org/10.1145/1037949.
1024415
14
