Memory virtualization in virtualized systems: segmentation is better
  than paging by Teabe, Boris et al.
ar
X
iv
:2
00
6.
00
38
0v
1 
 [c
s.O
S]
  3
0 M
ay
 20
20
Memory virtualization in virtualized systems:
segmentation is better than paging
Boris TEABE
Université de Toulouse
Peterson YUHALA
Université de Neuchâtel
Alain TCHANA
ENS Lyon
Fabien HERMENIER
Nutanix
Daniel HAGIMONT
Université de Toulouse
Gilles MULLER
Inria
Abstract
The utilization of paging for virtual machine (VM) mem-
ory management is the root cause of memory virtualization
overhead. This paper shows that paging is not necessary in
the hypervisor. In fact, memory fragmentation, which ex-
plains paging utilization, is not an issue in virtualized data-
centers thanks to VM memory demand patterns. Our solu-
tion Compromis, a novel Memory Management Unit, uses
direct segment for VMmemory management combined with
paging for VM’s processes. The paper presents a system-
atic methodology for implementing Compromis in the hard-
ware, the hypervisor and the datacenter scheduler. Evalua-
tion results show that Compromis outperforms the two popu-
lar memory virtualization solutions: shadow paging and Ex-
tended Page Table by up to 30% and 370% respectively.
ACM Reference format:
Boris TEABE, Peterson YUHALA, Alain TCHANA, Fabien HER-
MENIER, Daniel HAGIMONT, and Gilles MULLER. 2016. Mem-
ory virtualization in virtualized systems: segmentation is better than
paging. In Proceedings of ACM Conference, Washington, DC, USA,
July 2017 (Conference’17), 14 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
1 Introduction
Virtualization has become the de facto cloud computing stan-
dard because it brings several benefits such as optimal server
utilization, security, fault tolerance and quick service deploy-
ment [1, 3, 10]. However, there is still room for improve-
ment, mainly at the memory level which represents up to
90% [45] of the global virtualization overhead.
Memory virtualization overhead comes from the necessity
to manage three address spaces (application, guest OS and
host OS) instead of two (application and OS) as in native
systems. Shadow paging [43] is the most popular memory
virtualization solution. Each page table inside the guest OS
is shadowed by a corresponding page table inside the hyper-
visor, which contains the real mapping between Guest Vir-
tual Addresses (GVA) and Host Physical Addresses (HPA).
Thus, shadow page tables are those used for address transla-
tion by the hardware page table walker (which resides inside
the Memory Management Unit, MMU). Page tables inside
the guest OS are never used.
Shadow paging leads to one-dimensional (1D) page walk
on TLBmiss, as in a native system. However, building shadow
page tables comes with costly context switches between the
guest OS and the hypervisor for synchronization. Nested/Extended
Page Table (EPT) [15, 42] has been introduced for avoiding
page table synchronization cost. It improves the page table
walker to walk through two page tables (from the guest and
from the hypervisor) at the same time in a 2D manner. Thus,
building the shadow page table does not require the protec-
tion of guest’s page tables. This drastically reduces the num-
ber of context switches. However, this solution induces sev-
eral memory accesses during address translation due to the
2D page walk mechanism. In a radix-4 page table [40] (the
most popular case) for instance, this 2D page walk leads to
24 memory accesses on each TLB miss instead of 4 in a na-
tive system, resulting in significant performance degradation.
While manywork proved the effectiveness of pagingwhen
dealing with processes (e.g., for reducing memory fragmen-
tation), to our knowledge, there is no clear assessment of
its effectiveness when dealing with virtual machines (VMs).
One explanation is that the implementation of hypervisors
was inspired by the bare metal implementation of OSes. By
analyzing traces from two public clouds (Microsoft Azure
and Bitbrain) and 308 private clouds (managed by Nutanix1),
we show in Section 4 that paging is not mandatory for manag-
ing memory allocated to VMs. In fact, we found that mem-
ory fragmentation is not an issue in virtualized datacenters
thanks to VM memory sizing and arrival rate.
This paper presents Compromis, a novel MMU solution
which uses segmentation for VMs and paging for processes
inside VMs. Compromis allows a 1D page walk and gen-
erates zero context switch to virtualize memory. Compro-
mis is inspired by Direct Segment (DS) introduced by Basu
et al. [13] for native systems. For memory hungry applica-
tions which allocate at start time their entire memory and
self-manage it at runtime (e.g., Java Virtual Machine), their
virtual memory space can be directly mapped to a large phys-
ical memory segment identified by a triple (Base, Limit, Off-
set). This way, the translation of a virtual addressva is given
by a simple register to register addition (va+Offset). Com-
promis generalizes DS to DS-n, allowing the provisioning of
a VM with several memory segments. In Compromis, every
1Nutanix is a world wide private cloud provider.
1
processor context includes 2n new registers for address trans-
lation. Contrary to other DS based solutions [9, 24], Compro-
mis considers the entire VM memory and requires no guest
OS and application modification.
This paper also investigates systems implications and presents
a systematic methodology for adapting the hypervisor and
other cloud services (e.g., datacenter scheduler) for making
Compromis effective. To the best of our knowledge, this is
the first DS based approach in virtualized systems which puts
the entire puzzle together.
We have implemented a whole prototype in both Xen and
KVMvirtualized systemsmanaged by OpenStack. This means
first the integration of our DS aware memory allocation algo-
rithm. Second the improvement of the Virtual Machine Con-
trol Structure (VMCS) data structure for configuring the new
hardware registers introduced by DS-n. Finally, the improve-
ment of OpenStack’s VM placement algorithm to minimize
the number of memory segments. For evaluations, since our
solution relies on the modification of the hardware, we mim-
icked the functioning of a VMwhich runs on a DS-nmachine
as follows. We run the VM in para-virtualized (PV) mode
[11] because the latter uses 1D page walk as DS-n. However
in PV, all page table modifications performed by the VM ker-
nel trap into the hypervisor using hypercalls. To avoid this
behavior which will not exist on a DS-n machine, we have
modified the guest kernel to directly set page table entries
with the correct HPAs, calculated in the same way as a DS-n
hardware would have done.
To be exhaustive, the paper makes the following contribu-
tions:
• We first study the potential effectiveness of DS in vir-
tualized datacenters. In order words, we answer this
question: regarding VM memory demands, arrival
times and departure times, is it likely to provision all
or the majority of the VMs with one large memory
segment? To answer this question, we study mem-
ory fragmentation in virtualized systems by analyz-
ing traces from two public clouds (Microsoft Azure
[21] and Bitbrain [39]) and 308 private clouds. We
found that using a DS aware memory allocation sys-
tem, memory fragmentation is not a critical issue in
virtualized datacenters as in native ones.
• Drawing on this conviction, we propose DS-n, a gen-
eralization of DS to provision a VM with multiple
memory segments. We present the necessary hard-
ware modifications required by DS-n.
• We propose a DS aware VM memory allocation algo-
rithm which minimizes the number of memory seg-
ments to use.
• We evaluated the performance gain of DS-n using an
accurate methodology on a real environment. The
main results are as follows. First, the analyzed dat-
acenter traces exhibit that it is then possible to provi-
sion up to 99.99% of the VMs with one memory seg-
ment while three segments are sufficient to provision
all the VMs. Second, concerning the performance
gain, DS-n reduces memory virtualization overhead
to only 0.35%, outperforming both shadow paging
and EPT by up to 30% and 370% respectively. The re-
sults also show that our memory allocation algorithm
runs faster than traditional ones. Xen’s algorithm is
outperformed by 80%.
The remainder of the paper is as follows. Section 2 presents
the necessary background to understand our contributions.
Section 3 evaluates the limitations of state-of-the-art solu-
tions. Section 4 presents the analysis of several production
datacenter traces and validate the opportunity to apply DS in
virtualized systems. Section 5 presents the necessary hard-
ware and software improvements to make DS-n effective.
Section 6 presents the evaluation results. Section 7 presents
the related work. Section 8 concludes the paper.
2 Background
This section presents the two main techniques used to achieve
memory virtualization, namely Shadow paging [43] and Ex-
tended Page Table (noted EPT) [15, 42].
2.1 Shadow paging
Shadow paging is a software memory virtualization technique.
In Shadow paging the hypervisor creates a shadow Page Ta-
ble (PT) for each guest PT. This shadow PT holds the com-
plete translation from GVA to HPA. It is walked by the Hard-
ware Page Table Walker (HPTW) on Translation Lookaside
Buffer (TLB) miss. Guest Page Tables (GPTs) are fake ones,
they are not exploited. To put in place shadow paging, the
hypervisor write protects both the CR3 register (which holds
the current PT address) and GPTs. Each time the guest OS
attempts to modify these structures, it then traps in the hy-
pervisor which fix the CR3 register or the shadow PT. Using
shadow paging, the HPTW only performs a 1D page walk as
in a native system, leading to 4 memory accesses. However,
the resulting context switches severely degrades the VM per-
formance.
2.2 Extended Page Table
EPT (also called Nested Page Table) is a hardware-assisted
memory virtualization solution proposed by many chip ven-
dors such as Intel and AMD. It relies on a two layer PT. The
first PT layer resides in the guest address space and is ex-
clusively managed by its OS, at the rate of one PT per pro-
cess. This first layer PT thus contains GPAs which point to
guest pages in the guest address space. Every process con-
text switch triggers the setting by the guest OS of the CR3
register with the GPA of the scheduled-in process’s PT ad-
dress. The second PT layer resides in the hypervisor, at the
2
Benchmark Description
SpecCPU 2006 Compute multi-threaded workloads
PARSEC 3.0 Compute multi-threaded workloads
Redis In-memory database
Elastic search In-memory database
Table 1. Benchmarks used for assessment and evaluation of
our solution.
Native total cycles of all page walks
EPT total cycles of all (GPT+EPT) walks
Shadow total cycles of all (hypervisor level PT walk+ VMEn-
try+VMExit+handler)
Table 2. Formulas to estimate the overhead of memory virtu-
alization. ("handler" is the handler which treats the VMExit
generated when the guest OS attempts to modify the page
table.)
rate of one PT per VM. This PT represents the address space
of the guest and it includes HPA which point to pages (real
pages in RAM) in the host address space. Every vCPU con-
text switch triggers the setting by the hypervisor of the nested
CR3 register (nCR3) with the HPA of the scheduled-in VM’s
PT address. On TLB miss, the hardware page table walker
translates a virtual address va into the corresponding HPA
by performing a 2-dimension page walk, leading to 24 mem-
ory accesses.
3 Assessment
This section presents the overhead of memory virtualization
in both native and virtualized systems. Note that even in
a native system, the expression "memory virtualization" is
used because of the mapping of the process linear address
space to the physical address space.
Methodology. Table 1 lists the benchmarks used to evalu-
ate the performance overhead of memory virtualization. We
do this while varying the memory page size in both the hy-
pervisor and the guest OS. This way, we also evaluated huge
page-based solutions. The notation gX-hY means X and Y
are respectively the memory page size in the guest OS and
the hypervisor. The evaluation metric is the time taken by
both the hardware and the software for memory virtualiza-
tion. Table 2 presents how the performance metric is cal-
culated for each virtualization technology. We rely on both
PerformanceMonitoring Counters (PMC) and low-level soft-
ware probes that we have written for the purpose of this paper.
Details on the experimental setup is given in Section 6.2.1.
Results. Figure 1 presents the results, interpreted as fol-
lows. First, even in native systems memory virtualization
takes a significant time proportion in the execution of an
application, up to 42% for mcf. Second, running applica-
tions in a virtualized environment increases that duration, up
to 50.93% for Elastic Search under shadow paging. Third,
shadow paging incurs more overhead for the majority of ap-
plications than EPT, up to 43.89% of difference for vips. Fi-
nally, we can observe that even when huge pages are used
simultaneously in the guest OS and the hypervisor, memory
virtualization overhead is still high, almost 31.5% for Redis
benchmark. [9, 13, 24, 35] have reached the same conclusion
with the use of huge pages.
Synthesis. These results show that the overhead of mem-
ory virtualization is very significant in a virtualized system
even with huge pages. The root cause of this overhead is the
utilization of paging as the memory virtualization basis for
VMs.
4 Paging is not a fatality for VMs
Several research work tried to reduce the overhead of mem-
ory virtualization in virtualized systems. However, no work
has questioned the relevance of paging in this context. This
section studies the (ir)relevance of paging when dealing with
VMs. To this end, we compare paging with segmentation,
which is the alternative approach that has been left out.
Paging consists in organizing both the virtual address space
of a process and the physical address space into fixed size
memory chunks (4KB, 2MB, etc.) called pages. Thus, each
virtual page can be housed in any physical page frame. The
process PT and the HPTWmake it possible to find the actual
mapping of a virtual page to a physical page address. Seg-
mentation on the other hand organizes both the virtual and
the physical address spaces in the form of variable size mem-
ory chunks called segments. The size of the latter is chosen
by the programmer. The correspondence between virtual seg-
ments and physical segments is provided by a segment table.
The virtual address to physical address mapping is done by a
simple addition.
The main reasons which promote paging over segmenta-
tion in native systems are as follows: (R1) Paging is invisible
to the programmer; this is not the case with segmentation
which hardens application programming. (R2) Paging makes
the implementation of the memory allocator in the OS eas-
ier. Indeed, it only requires the use of a list of free pages
and the choice of any page within this list upon receiving
a memory allocation request is sufficient. This is important
for scalability. With segmentation, there is a need to find the
appropriate physical segment that satisfies the size of the vir-
tual segment requested by the application. (R3) Paging limits
memory fragmentation 2, which is not the case with segmen-
tation. (R4) Paging allows overcommitment, which is useful
for optimizing memory utilization.
The question is whether all these reasons are valid when
manipulating VMs. To answer this question, we analyze the
2Internal fragmentation within pages is always possible, but it is negligible
compared to the external fragmentation caused by segmentation
3
b
zi
p
2
g
cc
m
cf
g
o
b
m
k
h
m
m
er
sj
en
g
li
b
q
u
an
tu
m
h
2
6
4
re
f
o
m
n
et
p
p
as
ta
r
x
al
an
cb
m
k
b
la
ck
sc
h
o
le
s
b
o
d
y
tr
ac
k
ca
n
n
ea
l
fe
rr
et
fl
u
id
an
im
at
e
st
re
am
cl
u
st
er
v
ip
s
R
ed
is
E
la
st
ic
S
ea
rc
h
0
20
40
60
1
.
0
2
2
.
5
1
4
2
.
0
7
0
.
4
3
2
.
6
·
1
0
−
2
2
.
0
5 9
.
8
4
0
.
4
4
1
0
.
5
5
8
.
3
2
9
.
1
5
0
.
4
8
0
.
6
5
1
1
.
0
6
5
.
3
2
5
.
5
4
.
3
1
.
3
6 6
.
4
7
7
.
3
91
3
.
0
5
1
7
.
8
2
1
1
.
4
3
6
.
1
7
5
.
9
2 1
0
.
9
2
6
.
7
6
3
1
.
9
7
6
.
5
2
2
9
.
5
5
2
9
.
8
2
1
0
.
2
5
3
5
.
1
4
2
4
.
3
1
0
.
4
8
3
1
.
6
1
4
5
.
2
5
4
8
.
8
7
4
2
.
8
1 5
0
.
9
3
5
.
1
4
1
7
.
5
3
2
3
.
6
5
4
.
7
8
3
.
9
9
1
5
.
7
9
5
.
3
4
.
3
2
7
.
8
9
1
0
.
2
8
.
0
8
3
.
9
7
4
.
6
4
8
.
0
8
3
.
2
9
4
.
4
6
5
.
5
5
4
.
9
7
3
4
.
6
5
1
9
.
3
1
5
.
1
1
5
.
7
7
1
8
.
9
4
.
7
7
3
.
9
9
1
5
.
2
4
5
.
2
4
4
.
3 7
.
1
4
9
.
5
3
7
.
8
2
3
.
9
5
2
.
7
4
7
.
5
2
3
.
1
3
4
.
1
1
4
.
7
8
4
.
9
1
3
1
.
3
6
1
6
.
7
9
V
ir
t.
o
v
er
h
ea
d
(%
)
Native 4KB SP g4KB EPT g4KB-h2MB EPT g2MB-h2MB
Figure 1. Proportion of CPU time used for memory virtualization in native, virtualized shadow pagin (SP) and virtualized EPT.
relevance of each reason in virtualized systems. Before do-
ing this, note that when we talk about memory management
in a virtualized system, we are talking about the allocation
of physical memory to VMs and not memory allocation to
applications inside VMs.
Relevance ofR1 (Segmentation hardens application pro-
gramming). This reason is valid in native environments
(when dealing with applications) because application pro-
grammers do not have the expertise to manage segment size
in a segmentation based system. Moreover it is not the heart
of the business logic of their application. When dealing with
VMs, developers are OS developers, who are expert. Leav-
ing OS developers the responsibility to manage memory seg-
ments is within their reach.
Relevance of R2 (Paging makes memory allocation eas-
ier). It is necessary to facilitate the work of the memory
allocator for scalability purposes. In a native system, the
memory allocator is subject to thousands of memory alloca-
tion and deallocation requests per second. This is not true
when dealing with VMs. Each VM performs only one allo-
cation (at startup) and deallocation (at shutdown). Thus, the
frequency of memory allocation and deallocation requests re-
ceived by the hypervisor are not of the same order of magni-
tude as those received by the OS in a native system. Table 3
presents the average memory allocation frequency received
of a server from a native system and virtualized private and
public clouds (see Section 6 for more details on the analyzed
datasets). We observe a phenomenal difference between na-
tive and virtualized systems, which are quite stable. Given
the extremely low values for virtualized datacenters, the dif-
ficulty of finding free memory chunks does not mind with
segmentation when dealing with VMs.
Relevance of R3 (Paging limits fragmentation). Frag-
mentation is due to the heterogeneity of memory demand
sizes. Indeed, a system in which all demand sizes are identi-
cal would not suffer from fragmentation. To verify whether
fragmentation could be a problem in a virtualized datacenter,
we analyzed the memory demand sizes of the traces from the
datacenters presented in Table 3. Figure 2 shows the CDFs.
Dataset Alloc./Hour/Server
Native - Our lab machine 82071.5
Virtualized - Private clouds 0.056
Virtualized - Microsoft Azure public cloud 0.31
Table 3. Memory allocation frequency (per hour on a server)
in native and virtualized datacenters.
We observe that public clouds stand out with very concen-
trated demand sizes (14). This is because in public clouds
VM sizes are imposed. Things are slightly different in pri-
vate clouds (201) where there is more freedom in VM size
definition. In contrary, demand sizes vary a lot in native
systems (25k) than virtualized environments. These results
show that fragmentation is not a relevant issue when dealing
with VMs.
Relevance of R4 (Paging allows memory overcommit-
ment). Overcommitment is a practice which allows to re-
serve more memory than the physical machine actually has.
It exploits the fact that all applications do not require their
entire memory demand at the same time. As a result, re-
source waste is avoided. However, overcommitment comes
with performance degradation (during memory reclamation)
and performance unpredictability [31]. These limitations are
acceptable in a native system because there is no contract
between application owners and the datacenter owner; they
both belong to the same company. Best effort is the prac-
tice in such contexts. Things are different in a virtualized
datacenter, especially in commercial clouds. In the latter,
the datacenter operator should respect the contract signed
with the VM owner, who paid for the reserved resources.
Therefore, even if a VM is not using its resources, these re-
sources have already been amortized. The necessity to avoid
resource waste is less critical compared to a native system.
Futhermore, the implementation of overcommitment in a vir-
tualized system is challenging because of the VM blackbox
nature [31]. It requires an expertise in the workload and the
system to configure it and react in case of performance issues.
4
 0
 0.2
 0.4
 0.6
 0.8
 1
 
0
 
5
 
10
 
15
 
20
 
25
 
30
 
35
 
40
 
45
CD
F
Memory sizes (GB)
Native (25k demand sizes/1,969,716 demands)
 0
 0.4
 0.8
Zoom 0->200KB
 0
 0.2
 0.4
 0.6
 0.8
 1
 
0
 
20
 
40
 
60
 
80
 
10
0
CD
F
Memory sizes (GB)
Azure (14 VM sizes/2,013,767 VMs)
 0
 0.2
 0.4
 0.6
 0.8
 1
 
0
 
50
 
10
0
 
15
0
 
20
0
 
25
0
 
30
0
 
35
0
CD
F
Memory sizes(GB)
Private clouds (201 VM sizes/301,440 VMs))
Figure 2. CDF of Memory demand sizes in different datacenter types.
As a consequence, no public clouds support it. Private cloud
providers either do not support it (Nutanix), disable it by de-
fault (VMWare TPS, Hyper-V dynamic memory) or enable
it with extra warnings (RedHat with KVM).
5 Compromis: A DS based memory
virtualization approach for VMs
Compromis is a hardware memory virtualization solution im-
plemented within the MMU that exploits the strengths of
both direct segment (DS) and paging. The former is used
by the hypervisor to deal with VMs while the latter is used
by the guest OS to deal with processes. The innovation is the
utilization of DS instead of paging by the hypervisor. Con-
sidering the fact that it may be impossible to satisfy a VM
demand using a single memory segment, Compromis gener-
alizes DS to DS-n. In the latter, a VM which is allocated k
segments, with 1 ≤ k ≤ n, uses the Compromis hardware
feature. This section presents the set of improvements that
should be applied to the datacenter stack in order to make
Compromis effective.
5.1 General overview
Figure 3 presents the general operations of a datacenter using
Compromis. When a user requests a VM instantiation from
a flavor (#CPU, memory size), the cloud scheduler chooses
the physical machine that will host the VM instance. This
choice is made according to a placement policy, which gen-
erally takes into account the resource availability constraints.
In a Compromis aware datacenter, this policy is extended to
choose the machine with the greatest chance of allocating
large memory segments to the VM. To this end, the cloud
scheduler quickly simulates the execution of the memory al-
locator implemented by the hypervisor of compute nodes.
This simulation is built by the cloud scheduler on top of the
current state of the memory layout of every machine that is
locally stored and periodically updated (see Section 5.4).
When the hypervisor of the selected physical machine re-
ceives the VM instantiation request, it reserves the memory
for the VM in the form of large memory segments rather than
small page chunks as it is currently done (see Section 5.3). If
the number of segments used to satisfy the VM is less than
or equal to n, then the hypervisor configures the VM in DS-n
Figure 3. General functioning of a datacenter which imple-
ments Compromis.
mode, a new mode that the hardware implements (see Sec-
tion 5.5.1). Otherwise, the VM is configured in shadow or
EPT mode, depending on the datacenter operator. In DS-n
mode, the hardware performs an address translation by do-
ing a 1D page walk (instead of 2D) followed by a series of
register to register operations (see Section 5.2). Notice that
a Compromis aware machine can simultaneously run DS-n
and not DS-n VMs. The next subsections detail the modi-
fications that should be applied to each datacenter layer for
building Compromis.
5.2 Hardware level contribution
A hardware which implements Compromis includes new reg-
isters to indicate the mapping of GPA segments (in the guest
address space) to HPA segments (in the host address space).
The value of each register comes from a Virtual Machine
Control Structure (VMCS), configured by the hypervisor at
VM startup (see Section 5.3). The number of added regis-
ters is a function of n. That is n − 1 guest base registers
5
Figure 4. Address translation handling in two DS-n machine types (top DS-1 and bottom DS-4).
(noted GBReд1, ..., GBReдn−1, no such registers in DS-1),
n host base registers (noted HBReд0, ..., HBReдn−1), and
the limit register. These registers indicate the mapping as
follows. The GPA segment [0,GBReд1 − 1] is mapped to
the HPA segment [HBReд0,GBReд1 − 1]. The GPA segment
[GBReдi−1,GBReдi−1] is mapped to the HPA segment [HBReдi−1,HBReдi−1+
(GBReдi −GBReдi−1)] (where
GBReдi − GBReдi−1 is the size of this segment). For a VM
with k segments, the mapping of the last GPA segment [GBReдk−1,
GBReдk−1 +(limit−HBReдk−1)] is the HPA segment [HBReдk−1,
limit]. Once the configuration of the registers is made by the
hypervisor, the translation of a virtual address va to the cor-
responding HPA hpa for a DS-n VM type (whose number of
segment is lower than n) is summarized in Figure 4. Firstly,
the MMU performs a 1D GPT walk, taking as input va. This
operation returns a GPA gpa. Then hpa is calculated as fol-
lows:
hpa = HBRreдi + (дpa −GBRreдi ) (1)
with [GBRreдi , ∗] being the smallest GPA segment which
contains gpa. If no such segment exists, a boundary viola-
tion is raised and trap in the hypervisor as a "DS-n violation"
exception. More generally, for each gpa extracted from a
GPT layer, an offset addition followed by a comparison is
performed,meaning that every EPTwalk is replaced by these
two operations. For instance, when the VM has only one seg-
ment, the computation of hpa is as follows
hpa = HBRreд0 + дpa (2)
There is a boundary violation here if hpa is greater than limit.
The performance benefit of these operations against the 2D
page walk done in EPT is discussed in Section 6.
5.3 Hypervisor level contribution
The hypervisor needs two main changes: the integration of
a memory allocator for VMs and the configuration of the
VMCS to indicate DS-n type VMs.
5.3.1 A DS based memory allocator for VMs
We assume that the physical memory is organized in two
parts: the first part is reserved for hypervisor and privileged
VM tasks while the second part is dedicated to user VMs.
This memory organization is found in almost all popular hy-
pervisors. In Compromis, the first memory part is managed
using the traditional memory allocator. Concerning the sec-
ond memory part, a new allocation algorithm is used to en-
force large memory segment allocation to VMs. This section
describes this new allocator.
Implementing a memory allocator requires to answer three
questions: (Q1) which data structure to use for storing in-
formation about free memory segments? (Q2) how do we
choose elements from this data structure for responding to
6
an allocation request? (Q3) how do we insert an element into
this data structure when there is a memory release?
Answer to Q1: data structure. We use a doubly linked
list to describe free memory segments. Each element in the
list describes a segment using three variables:
• base: start address of the segment;
• limit: end address of the segment;
• date: allocation date of the segment containing base-
1.
The elements of the list are ordered in an ascending order of
base. Hereafter, an element of the list is noted [base,limit ,date].
Answer to Q2: allocation policy. When the hypervisor
receives a request to start a VM with a memory demand M ,
it goes through the list described above to find out which
segments should be allocated to the VM. If it finds a seg-
ment of size M , then that segment is taken off the list and
allocated to the VM. Otherwise, the allocator chooses the
largest segment Sb [base,limiti ,date] among segments which
size is greater thanM . The VM is satisfied with a portion of
Sb . Note that taking the largest segment prevents the multi-
plication of small segments, which are bad for a DS based
approach. If there is no segment larger than M , then two op-
tions are possible. The first option (Opt1) satisfies the VM
with the smaller segments. This allows to give a chance to
the VMs which will come later to have the big free segments.
The second option (Opt2) chooses the largest segment that
exists and executes the above algorithm with a new memory
size M ′, with M ′ equalsM minus the size of the chosen seg-
ment.
Compromis offers these two options to support various
workload patterns and datacenter constitution. A workload
pattern is the set of VM instantiation and shutdown requests
submitted to the datacenter during a period of time. The con-
stitution of a datacenter is the physical machine sizes. The
option selection is the responsibility of the cloud scheduler,
which has a global view of the datacenter (see Section 5.4).
Answer to Q3: freed memory taken into account. Stop-
ping a VM results in the free of its memory, which has to
be inserted into the list of free memory segments. Let S be
a memory segment to insert into the list. The insertion is as
follows. If S coincides with the beginning or the end of a
segment S ′ in the list, then S ′ is simply extended (forward or
backward). If this extension causes the new big segment to
coincide with the beginning or the end of the segments that
follow or precede it, then the extension continues. If there is
no border coincidence between S and the existing segments
in the list, then S is inserted in the list so that the ascending
order is respected.
5.3.2 VM type configuration
Let k be the number of memory segments allocated to a
VM. If k ≤ n then the VM is of type DS-n, otherwise it
is configured with EPT or shadow paging according to the
datacenter administrator choice. The type configuration of
a VM is done by modifying the VMCS of its vCPUs. To
indicate that a VM is of type DS-n, a new bit of the Sec-
ondary Processor-Based VM-Execution Controls is set. Oth-
erwise, this bit remains at zero. For DS-n VMs, the hypervi-
sor also positions new VMCS fields that will be used to pop-
ulate GBReg, HBReg, and limit registers. The fields are pop-
ulated in ascending order of crossing segments. The value
of the fields which map to HBReg and limit registers comes
from the list of segments allocated to the VM. Concerning
the fields which map to GBReg registers, their values are cal-
culated as they are filled. When k < n, the remaining fields
are set to zero.
5.4 Cloud scheduler level contribution
The cloud scheduler is improved for two purposes: a DS-n
aware VM placement algorithm and memory allocation op-
tion selection.
VM placement algorithm improvement. The placement
algorithm determines the physical machine that will instan-
tiate the VM. Traditionally, it has an objective such as load
balancing. For example, the schedulers of OpenStack [23]
and CloudStack [20] consist of a list of filters. Each filter
implements a concern such as resource-matchmaking [36],
VM-VM or VM-host (anti-) affinities. A filter receives as ar-
guments a set of possible machine for the VM to boot and
remove among them those not satisfying its concern. Each
time the scheduler is invoked to decide where to place a VM,
it chains the filters to retrieve eventually the satisfying ma-
chines and pick one among them.
To take the benefit of DS-n, the VM scheduler must in-
tegrate inside its objective the maximization of the number
of VMs of type DS-n. For filter-based scheduler, it consists
in implementing is a new filter to append to the existing list.
This filter maintains a local copy of the free memory seg-
ments on every machine and uses a simulator to evaluate the
number of segments that will be used if the VM is instanti-
ated on each machine. It then selects the machine leading
to the least number of segments. For schedulers that do not
rely on filters, the rational is to weight the existing objective
against the one that consists in picking the machine minimiz-
ing the number of memory segments. Note that this cloud
scheduler modification does not affect (reduce) the hosting
capacity of the datacenter because the destination machine is
selected among the original cloud scheduler candidates.
Memory allocation option selection. Section 5.3.1 re-
ported that the cloud scheduler has the responsibility to se-
lect the memory allocation option that all hypervisors will
use. To this end, it embeds a memory allocator simulator
which implements the two options presented in 5.3.1. Then
it periodically (e.g., every week) replays in the simulator the
recorded VM startup and shutdown logs. This is done while
varying the memory allocation option. The selected option
is the one that produces the large number of DS-n VMs. All
7
hypervisors are then notified with the name of the selected
option and the logs repository is reset.
5.5 Prototype
We implementedCompromis in two popular hypervisors (Xen
and KVM), as well as in OpenStack’s Nova scheduler.
5.5.1 Implementation in the hypervisor
Implementation in Xen. The implementation of Compro-
mis in Xen is straightforward. First Xen already organizes
the main memory in two parts as we wish. The first part is
managed by the Linux’s memory allocator subsystem hosted
within the privileged VM (dom0). The memory allocator for
user VMs resides in the hypervisor core. It is invoked by the
dom0 during the VM instantiation process. We simply re-
placed this allocator with the one described in Section 5.3.1.
We validated the effectiveness of this algorithm by starting
VMs (in hardware-assisted virtualization (HVM)mode) with
single segments, while the hypervisor still uses EPT for ad-
dress translation.
Concerning the configuration of the VM type, the modi-
fication of Xen does not require any particular description
other than what has been said in Section 5.3.2. Concerning
the handling of cloud scheduler notifications related to the
changing of the memory allocation option, we define a new
hypercall that inform the hypervisor with the name of the
selected option.
Implementation in KVM. Unlike Xen, KVM does not
hold memory in two blocks. KVM relies on the Linux mem-
ory allocator which sees VMs as normal processes. To im-
plement Compromis in KVM, we first enforce the organiza-
tion of the physical memory in two blocks. To this end, we
use the cgroup mechanism. Then the default Linux memory
allocator is associated to the first block while our memory
allocator manages the second block. The /proc file system is
used to record the used memory allocation option imposed
by the cloud scheduler.
5.5.2 Implementation in the cloud scheduler
The implementation of Compromis in OpenStack Nova is
quite straightforward because Nova’s placement algorithm
is easy to identify. Its execution steps are also easy to iden-
tify, leading its extension with a simulation of our memory
allocator very simple. Concerning the periodical selection
of the memory allocation option, we implemented a separate
process which starts at the same time as Nova. That process
relies on existing OpenStack logs for obtaining VM startup
and shutdown requests.
5.6 Discussion
Memory overcommitment. Since Compromis allows DS-
n, the implementation of memory overcommitment is possi-
ble by performing dynamic segment resizing, addition, or re-
moval, combinedwith a slight cooperation between the guest
OS and the hypervisor. A VM which needs more memory
gains new segments or sees its segments extended. Inversely,
a VM which memory needs to be reduced will see either its
segment sizes or number reduce. The cooperation between
the guest OS and the hypervisor is only necessary in this
case. In fact, the hypervisor should indicate to the guest OS
the range of GPAs that should be released by the VM (using
the balloon driver mechanism). Indeed, the hypervisor is the
only component which knows segment ranges.
MemoryMapped IO (MMIO) region virtualization. IO
device emulation and direct IO are the two IO virtualization
solutions implemented by hypervisors in HVM mode. The
former solution, which is the most popular one, consists in
protecting virtual MMIO ranges seen by the guest OS so that
all IO operations performed by the guest trap in the hyper-
visor. With this IO virtualization solution, the utilization of
Compromis is straightforward since virtual MMIO regions
are at the GPA layer. The validation step presented in Sec-
tion 5.5.1 was performed under this solution. With direct IO
virtualization, the guest OS is directly presented the physical
MMIO ranges configured by the hardware device. This so-
lution requires Compromis to use several memory segments.
Note that this solution is not popular in todays clouds be-
cause it limits scalability (only enables very few virtual de-
vices) and dynamic consolidation (VM live migration is not
possible).
6 Evaluations
We evaluated the following aspects: (1) Effectiveness (see
section 6.1): it is the capability to start a large number of
VMs using the DS-n technology; (2) Performance gain (see
section 6.2): it is the capability to ameliorate the perfor-
mance of applications which run in DS-n VMs; (3) Startup
impact (see Section 6.3): it is the potential positive/negative
impact on VM startup latency. Otherwise indicated, the used
hypervisor and cloud management system are respectively
Xen and OpenStack.
6.1 Effectiveness
Effectiveness evaluation is done by simulation using real dat-
acenter traces.
6.1.1 Methodology
We developed a simulator which mimics a datacenter man-
aged with OpenStack [23], improved with our contributions.
The simulator replays VM startup and shutdown requests col-
lected from several production datacenters, presented in Sec-
tion 6.1.2. It considers that a VM demand includes a number
of CPU cores and a memory size. For each simulated VM
startup request, the simulator logs two metrics: the number
of segments used for satisfying the VMmemory demand and
the time taken by our changes (extension of the cloud sched-
uler and the utilization of our memory allocation algorithm
in the hypervisor).
8
To highlight the benefits of each Compromis feature, we
evaluated different versions including:
• BaseLine: the simulator implements both the native
OpenStack scheduler and Xen’s memory allocation
algorithms;
• ImprovPlacement: in this version the VM placement
algorithm is improved to choose for every VM the
machinewhich will use the minimumnumber of mem-
ory segments (as described in Section 5.3.1);
• DynamicOptionSelec: in this version the cloud sched-
uler calculates every week the best memory alloca-
tion option which will be used (as described in Sec-
tion 5.4).
6.1.2 Datasets
We used the traces of 2 public clouds (Bitbrains [39] and
Microsoft Azure [21]) and 308 private clouds. Among other
fields, each trace includes: the VM creation and destruction
time, and the VM size (#CPU and memory size).
Bitbrains. This cloud is a service provider specialized
in managed hosting and business computation for many en-
terprises. The dataset consists of 1,750 VMs, collected be-
tween August and September 2013. Bitbrains does not in-
clude physical machine characteristics.
Azure. This is a public Microsoft cloud. The dataset
comprises 2, 013, 767 VMs running on Azure from Novem-
ber 16th , 2016 to February 16th , 2017.
Private clouds. This group aggregates data of 308 private
IaaS clouds running diverse workloads between November
1st , 2018 to November 29th , 2018. For a given cloud, we col-
lected one or more consistent snapshots of the cluster state at
the moment the cluster triggered its hotspot mitigation ser-
vice, which indicates that a machine is getting close to sat-
uration. A snapshot depicts the running VMs, their sizing
(in terms of memory and cores) and their host (in terms of
available memory and cores). The collected dataset includes
301,440 VMs. As the dataset contains snapshots and not
the VM creation and destruction time, we derived from each
snapshot a bootstorm scenario where all the VMs are created
simultaneously. This dataset includes server characteristics.
Composition and server characteristics used for Bitbrains
and Azure. Having no hardware information about the first
two datasets, we consider that they are composed of server
generations presented in Table 4. We chose these server gen-
erations as they are used in Azure according to this Youtube
comment [2]. Gen6 and Godzilla are new generations while
Gen2 HPC, Gen4 and Gen5 are older ones. All server gener-
ations have the same proportion.
6.1.3 Results
Bitbrain and Azure - Table 5. BaseLine provides better
results in Bitbrain (up to 81% of VMs are satisfied with
less than four memory segments) comparing to Azure (only
Name RAM (GB) Cores % in the traces
HPC 128 24 20
Gen4 192 24 20
Gen5 256 40 20
Gen6 192 48 20
Godzilla 512 32 20
Table 4. Server generations used in the replay of Bitbrains
and Azure traces.
Bitbrain
Solution 1 seg. 2 seg. 3 seg. >3 seg.
BaseLine 12.816 44.376 30.078 12.728
ImprovPlacement+Opt1 100 0 0 0
ImprovPlacement+Opt2 100 0 0 0
DynamicOptionSelec 100 0 0 0
Azure
Solution 1 seg. 2 seg. 3 seg. >3 seg.
BaseLine 3.581 11.171 9.996 75.252
ImprovPlacement+Opt1 99.9736 0.026 0 6.18E-05
ImprovPlacement+Opt2 99.947 0.007 0.022 0.021
DynamicOptionSelec 99.999 7.07E-04 0 0
Table 5. Number of memory segments allocated to VMs
from Bitbrain and Azure.
about 24% of VMs are satisfied with less than four mem-
ory segments). This is because the VMs running on Bitbrain
have a longer life time than Azure. However, our solutions
satisfy more VMs than BaseLine (99.95%-100%). This is be-
cause BaseLine, which implements Xen, organizes the phys-
ical memory in the forms of small memory chunks which are
then used for allocation. As a naive algorithm, Xen cannot
enforce DS to a VM even if it exists a free memory segment
which is larger than the memory demand. In contrast, Com-
promis enforces DS near to the perfection (more than 99% of
VMs are satisfied with only one memory segment). Our two
memory application options discussed in 5.3.1 show their
slight difference in Bitbrain: ImprovPlacement+Opt1 satis-
fies more VMs with only one segment in comparison with
ImprovPlacement+Opt2. Finally, dynamically switching be-
tween the two options (DynamicOptionSelec) is the best so-
lution (99.99% of VMs use one memory segment).
Private clouds - Figure 5. We plot the results for these
clouds separately from the previous ones because of the mul-
titude number of clouds. We can make the same observation
as above. Our solutions satisfy almost all VMs with only one
memory segment, see a kind of wall at 1 on the latitude axis.
6.2 Performance gain
This section evaluates the performance gain brought by the
utilization of DS-n.
9
DynamicOptionSelec
 1  2
 3  4
 0
 20
 40
 60
 80
 100
ImprovPlacement+Opt2
 1  2
 3  4
 0
 20
 40
 60
 80
 100
ImprovPlacement+Opt1
 1  2
 3  4
 0
 20
 40
 60
 80
 100
BaseLine
 1  2
 3  4
 0
 20
 40
 60
 80
 100
Figure 5. Number of memory segments allocated
to VMs from 308 private clouds (longitude=cloud, lati-
tude=#segment, depth=proportion).
6.2.1 Methodology
A DS-n machine handles a TLB miss using a 1D page walk
follows by a set of register to register operations. We mimic
this functioning by running the VM in para-virtualized (noted
PV) mode [11] which also uses a 1D page walk. However in
PV, all page table modifications performed by the VM ker-
nel trap into the hypervisor using hypercalls. We modified
the guest kernel to directly set page table entries with the
correct HPAs, calculated in the same way as a DS-n hard-
ware would have done. The reader could legitimately ask
why using PV to simulate a hardware-assisted solution. We
claim that our approach makes sense in our context because
the benchmarks do not solicit PV machinery: all disks are
in-memory (tmpfs) based and all network requests use the
loopback interface. Accordingly, only the memory subsys-
tem is solicited.
The evaluation methodology we use is as follows. LetT1D
be the execution time of the VM in this modified PV con-
text. We estimate the cost (noted Tnreд2reд) of the register to
register operations performed by the DS-n hardware on TLB
miss using an assembly code which executes that operations.
It is adaptable according to the value of n. Let Ntlb be the
number of TLB misses (collected using PMC) generated by
the application when it is executed in a native system. We
estimate the execution time TDS−n of a VM on a DS-n using
this formula
TDS−n = T1D + Ntlb ×T
n
reд2reд (3)
We evaluated different values of n from 1 to 3. We compare
DS-n with EPT (in which the execution time is noted Tept )
and shadow paging (in which the execution time is noted
Tsha). We used 4KB page size in guest VMs as is the stan-
dard size. The characteristic of the experimental machine
is presented in Table 6. Notice that this machine includes a
Processor Single socket Intel(R) core (TM) i7-3768
@2.40GHz 4cores
Memory 16GB DDR4 1600MHz
DTLB 4-way, 64 entries
ITLB 4-way, 128 entries
Table 6. Characteristics of the experimental machine.
page walk cache [12]. The list of benchmarks we use (as pre-
vious work) are presented in Table 1. Each benchmark runs
in a VM having a single vCPU and 5 GB memory. The used
hypervisor and OS are Xen 4.8 and Ubuntu 16.04 (Linux ker-
nel 4.15) respectively.
6.2.2 Results
Figure 6 presents the evaluation results. We only present the
results for DS-1 because we obtained almost the same results
with DS-2 and DS-3. This is because the cost of register
to register operations realized in DS-1, DS-2 and DS-3 is
extremely low compared with the cost of a 2D page walk.
Figure 6 is interpreted as follows. First, obviously CPU
intensive only applications (e.g., hmmer from PARSEC) do
not benefit enough from DS-n. Second, we confirm that DS-
n almost nullifies the overhead of memory virtualization and
leads the application almost to the same performance as in
native systems. In fact, all black histogram bars are very
close to 1. DS-n outperforms both EPT (up to 30% of perfor-
mance difference for mcf) and shadow paging (up to 370%
of performance difference for Elastic Search). Finally, we
observe that DS-n produces a very low, close to zero, over-
head (0.35%) but also a stable overhead (0.42 standard de-
viation). While a smaller overhead is always appreciable, a
stable overhead can also be a requirement to host latency sen-
sitive applications like databases or real-time systems.
To justify the origin of this significant performance gap
between these memory virtualization technologies, we ana-
lyzed the values of the internal metrics focusing on appli-
cations Redis, gcc, and Elastic Search. For DS-n, the cost
of memory virtualization is CDS−n = C1D × N
DS−n
tlb
, where
C1D is the number of CPU cycles for performing a 1D page
walk and NDS−n
tlb
is the number of TLB misses. For EPT, that
cost is CEPT = C2D × N
EPT
tlb
, where C2D is the number of
CPU cycles used to perform a 2D page walk and N EPT
tlb
is
the number of TLB misses. For shadow paging, the cost is
CSha = C1D×N
Sha
tlb
+N Shaexit×(Cexit+Center+Chandler ), where
N Sha
tlb
is the number of TLB misses; N Shaexit is the number of
VMExit related to page table modification operations; Cexit
is the cost for performing VMExit followed by VMEnter;
and Chandler is the average execution time of memory man-
agement handlers in the hypervisor. Table 7 presents the val-
ues of these costs, according to our experimental machine.
We observe that CDS−n is very lower than CEPT (e.g., ×6 for
Redis) and CSha (×14).
10
b
zi
p
2
g
cc
m
cf
g
o
b
m
k
h
m
m
er
sj
en
g
li
b
q
u
an
tu
m
h
2
6
4
re
f
o
m
n
et
p
p
as
ta
r
x
al
an
cb
m
k
b
la
ck
sc
h
o
le
s
b
o
d
y
tr
ac
k
ca
n
n
ea
l
fe
rr
et
fl
u
id
an
im
at
e
st
re
am
cl
u
st
er
v
ip
s
R
ed
is
E
la
st
ic
S
ea
rc
h
0
20
40
60
80
8
2
9
5
8
9
2
6
9
O
v
er
h
ea
d
(%
)
SP g4KB EPT g4KB-h2MB DS-n g4KB
Figure 6. Performance overhead of DS-n compared with shadow paging (SP) and EPT. Lower is better.
Technology Redis gcc Elastic Search
CDS−n 3 13 14
CEPT 17 17 46
CSha 25 62 201
Table 7. The total cost (in second) of each memory virtual-
ization technology for Redis, gcc and Elastic Search.
Solution Bitbrain Azure Private clouds
BaseLine 6.42-139.47 17.76-520.55 1.92-18.27
DynamicOptionSelec 3.57-1.23 3.42-1.18 0.098-0.011
Table 8. Memory allocation latency (mean-stdev) in ms.
6.3 Startup impact
Recall that Compromis extends the Cloud scheduler (which
intervens on VM startup time) and changes the default mem-
ory allocator used by the hypervisor (also at VM start time).
Therefore, one may legitimately ask where these changes im-
pact the VM startup latency. We answer this question by
summing the cost of the extension with the cost of our mem-
ory allocation algorithm, the we compare it with the cost of
the default Xen memory allocation algorithm. We rely on
simulation logs generated during the evaluations presented
in Section 6.1. The experiment reports that almost all the
different versions of our solution have the same complexity,
thus we only present the results for DynamicOptionSelec in
Table 8. These results are interpreted as follows. First, we ob-
serve that our solution reduces the startup time, by up to 80%
for Azure VMs. The is because our allocation algorithm is
simpler with regards to Xen which organizes memory in sev-
eral memory chunk lists and iterate over these lists several
times to satisfy memory demand. Second, the smaller stan-
dard deviation reports that the startup time becomes more
stable than Xen. Such a predictability is critical for auto-
scaling services, as demonstrated by Nitu et. al. [32]. The
unpredictability of Xen comes from its complex memory al-
location algorithm presented above.
7 Related work
The overhead of memory virtualization in native systems has
been proven by several previous work [12–14, 16–18, 22, 26,
30, 33, 34, 45]. It has also been shown that this overhead is
exacerbated in virtualized environments [6–9, 15, 19, 24, 25,
25, 35, 42, 44, 45]. This section presents existing work in
the latter context. The research in this domain can be clas-
sified into two categories: software and hardware-assisted
solutions.
7.1 Software solutions
Direct paging [5] is similar to shadow paging [43] (presented
in Section 2.1), but it requires the modification of the guest
OS. In Direct paging [5], the hypervisor introduces an ad-
ditional level of abstraction between what the guest sees as
physical memory and the underlying machine memory. This
is done through the introduction of a Physical to Machine
(P2M) mapping maintained within the hypervisor such as in
shadow paging. The guest OS is aware of the P2M mapping
and is modified such that instead of writing PTE it would in-
stead write entries mapping virtual addresses directly to the
machine address space by using itself the P2M. As shadow
paging, direct paging uses a 1D page walk to handle a TLB
miss. However, it includes two main drawbacks: context
switches between the guest and the hypervisor for building
the P2M table, and the modification of the guest OS (making
proprietary OSes such as Windows not usable).
7.2 Hardware-assisted solutions
Both Intel and AMD proposed EPT [15, 42], a hardware-
assisted solution which does not include software solution’s
limitations. We have already presented this solution in Sec-
tion 2.2. As shown in the latter, EPT is far from satisfactory
because of the 2D page walk that it imposes. To reduce the
overhead caused by this 2D page walk, several works have
proposed the extension of the page walk cache (PWC) [12],
11
used in native systems. Such a cache avoids page walk on
PWC hit. Bhargava et al. [15] investigated for the first time
this extension of PWC for EPT. The main limitation of such
solutions is their inefficiency facing large working set size
VMs (e.g., in-memory databases) [45]. Also, PWC based so-
lutions suffer from a high rate of cache misses when several
VMs share the same machine due to cache evictions. Ahn et
al. [8] used a flat EPT instead of the traditional multi-level
radix. By this way, the authors reduced the number of mem-
ory references on TLB miss to 9. Compromis totally elim-
inates the EPT, resulting in 4 memory references for each
TLB miss.
Some solutions improved the TLB [35, 37, 45]. Ryoo et al.
[37] presented POM-TLB, a very large level-3 in RAM TLB.
POM-TLB brings two main advantages. First, the number of
TLB misses is reduced because of the large TLB size, thus
reducing the number of 2D page walks. Second, POM-TLB
benefits the data cache to reduce RAM references. However,
on cache miss a RAM access is necessary. Also, on POM-
TLB miss, the hardware is still performed a 2D page walk.
This solution can be used at the same time with Compromis.
Wang et al. [44] and Gandhi et al. [25] showed that neither
EPT nor shadow paging can be a definite winner. They pro-
posed dynamic switching mechanisms that exceed the ben-
efits of each technique. To this end, TLB misses and guest
page faults are monitored to determine the best technique to
apply. Such dynamic solutions come with a significant over-
head related to two tasks: the monitoring and the computa-
tion of considered metrics consume a lot of CPU cycles, and
switching from one technique or another requires to rebuild
new page tables.
7.3 Orthogonal solutions
Some researchers like Kwon et al. [28, 29] proposed the
utilization of huge pages [4, 38] in the guest OS and the hy-
pervisor at the same time. This way, the number of hierarchy
in the page table is reduced, thus the number of memory ref-
erences during page walk is reduced too. However, using
huge pages leads to two main limitations for the guest. First,
it increases memory fragmentation, thus memory waste for
the guest. This could lead to a memory pressure in the guest
OS, resulting in swapping, which is negative for application
performance. Second, huge pages increase average and tail
memory allocation latency in the guest because zeroing a
huge page at page allocation time is more time consuming
than zeroing a 4Kb page.
Talluri et al. [41] proposed Hashed page tables in na-
tive systems as an efficient alternative to the radix page ta-
ble structure. With hashed page tables, address translation
is done using a single memory reference, assuming no col-
lision. Yaniv et al. [45] presented how this technique can
be adapted for virtualized systems. The authors showed that
by using a 2D hashed page table hierarchy, the page walk is
done with 3 memory references instead of 24. This is one
less than in Compromis and native systems but suffers from
hash collisions.
Direct segment (DS) based solutions. Previous work
showed the benefits of DS in both native [13, 27] and virtu-
alized systems [9, 24, 25]. Alam et al. presented DVMT[9],
a mechanism which allows applications inside the VM to re-
quest DS allocations directly from the hypervisor. The ap-
plication is responsible for mapping GVAs which are in the
allocated DS address space. This is a limitation for appli-
cation developers who are not expert. Ganghi et al. [24]
proposed three memory virtualization solutions based on DS.
Their VMM Direct mode is very close to Compromis, but DS
does not concern the entire VM memory. In addition, the
authors mainly investigated the two other modes.
More generally, existing solutions in this category mainly
focused on hardware contributions while we study the entire
cloud stack consequences. Also, they relied only on simula-
tions while we tried to perform accurate experiments on real
machines using real systems. Finally, we motivate (relying
on trace analysis) for the first time the relevance of DS for
VMs.
8 Conclusion
This paper presented Compromis, a novel MMU solution for
virtualized systems. Compromis generalizes DS to provide
the entire VM memory space using a minimal number of
memory segments. This way, the hardware page table walker
performs a 1D page walk as in native systems. By analyz-
ing several production datacenter traces, the paper showed
thatCompromis provisioned up to 99.99%VMswith a single
memory segment. The paper presented a systematic imple-
mentation of Compromis in the hardware, the hypervisor and
the cloud scheduler. The evaluation results show that Com-
promis reduces the memory virtualization overhead to only
0.35%. Furthermore, Compromis reduces the VM startup la-
tency by up to 80% while providing also a predictable value.
References
[1] [n.d.]. Benefits of Virtualization.
https://www.thrivenetworks.com/blog/benefits-of-virtualization/ .
[2] [n.d.]. Inside Microsoft Azure datacenter hardware
and software architecture with Mark Russinovich.
https://www.youtube.com/watch?v=Lv8fDiTNHjk.
[3] [n.d.]. Top 5 Business Benefits of Server Virtualization.
https://blog.nhlearningsolutions.com/blog/top-5-ways-businesses-benefit-from-server-virtualization .
[4] [n.d.]. Transparent Hugepages. https://lwn.net/Articles/359158/.
[5] [n.d.]. X86 Paravirtualised Memory Management.
https://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management.
[6] Keith Adams and Ole Agesen. 2006. A Comparison of Software and
Hardware Techniques for x86 Virtualization. In Proceedings of the
12th International Conference on Architectural Support for Program-
ming Languages and Operating Systems (ASPLOS XII). ACM, New
York, NY, USA, 2–13. https://doi.org/10.1145/1168857.1168860
12
[7] Ole Agesen, Jim Mattson, Radu Rugina, and Jeffrey Shel-
don. 2012. Software Techniques for Avoiding Hardware
Virtualization Exits. In Proceedings of the 2012 USENIX
Conference on Annual Technical Conference (USENIX
ATC’12). USENIX Association, Berkeley, CA, USA, 35–35.
http://dl.acm.org/citation.cfm?id=2342821.2342856
[8] Jeongseob Ahn, Seongwook Jin, and Jaehyuk Huh. 2012. Revisiting
Hardware-assisted Page Walks for Virtualized Systems. In Proceed-
ings of the 39th Annual International Symposium on Computer Archi-
tecture (ISCA ’12). IEEE Computer Society, Washington, DC, USA,
476–487. http://dl.acm.org/citation.cfm?id=2337159.2337214
[9] Hanna Alam, Tianhao Zhang, Mattan Erez, and Yoav Etsion.
2017. Do-It-Yourself Virtual Memory Translation. In Proceed-
ings of the 44th Annual International Symposium on Computer
Architecture (ISCA ’17). ACM, New York, NY, USA, 457–468.
https://doi.org/10.1145/3079856.3080209
[10] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph,
Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel
Rabkin, Ion Stoica, and Matei Zaharia. 2010. A View of
Cloud Computing. Commun. ACM 53, 4 (April 2010), 50–58.
https://doi.org/10.1145/1721654.1721672
[11] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris,
Alex Ho, Rolf Neugebauer, Ian Pratt, and AndrewWarfield. 2003. Xen
and the art of virtualization. In IN SOSP. 164–177.
[12] Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Trans-
lation Caching: Skip, Don’T Walk (the Page Table). In Pro-
ceedings of the 37th Annual International Symposium on Com-
puter Architecture (ISCA ’10). ACM, New York, NY, USA, 48–59.
https://doi.org/10.1145/1815961.1815970
[13] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and
Michael M. Swift. 2013. Efficient Virtual Memory for Big Memory
Servers. In Proceedings of the 40th Annual International Symposium
on Computer Architecture (ISCA ’13). ACM, New York, NY, USA,
237–248. https://doi.org/10.1145/2485922.2485943
[14] Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2012.
Reducing Memory Reference Energy with Opportunistic
Virtual Caching. In Proceedings of the 39th Annual Inter-
national Symposium on Computer Architecture (ISCA ’12).
IEEE Computer Society, Washington, DC, USA, 297–308.
http://dl.acm.org/citation.cfm?id=2337159.2337194
[15] Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha
Manne. 2008. Accelerating Two-dimensional Page Walks for Vir-
tualized Systems. In Proceedings of the 13th International Confer-
ence on Architectural Support for Programming Languages and Op-
erating Systems (ASPLOS XIII). ACM, New York, NY, USA, 26–35.
https://doi.org/10.1145/1346281.1346286
[16] Abhishek Bhattacharjee. 2013. Large-reach Memory Man-
agement Unit Caches. In Proceedings of the 46th An-
nual IEEE/ACM International Symposium on Microarchi-
tecture (MICRO-46). ACM, New York, NY, USA, 383–394.
https://doi.org/10.1145/2540708.2540741
[17] Abhishek Bhattacharjee. 2017. Translation-Triggered Prefetch-
ing. SIGARCH Comput. Archit. News 45, 1 (April 2017), 63–76.
https://doi.org/10.1145/3093337.3037705
[18] Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi.
2011. Shared Last-level TLBs for Chip Multiprocessors.
In Proceedings of the 2011 IEEE 17th International Sym-
posium on High Performance Computer Architecture (HPCA
’11). IEEE Computer Society, Washington, DC, USA, 62–63.
http://dl.acm.org/citation.cfm?id=2014698.2014896
[19] Xiaotao Chang, Hubertus Franke, Yi Ge, Tao Liu, Kun Wang, Jimi
Xenidis, Fei Chen, and Yu Zhang. 2013. Improving Virtualization
in the Presence of Software Managed Translation Lookaside Buffers.
In Proceedings of the 40th Annual International Symposium on Com-
puter Architecture (ISCA ’13). ACM, New York, NY, USA, 120–129.
https://doi.org/10.1145/2485922.2485933
[20] cloudstack [n.d.]. Apache CloudStack – Open Source Cloud Comput-
ing. http://cloudstack.apache.org/.
[21] Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Mar-
cus Fontoura, and Ricardo Bianchini. 2017. Resource Central: Under-
standing and Predicting Workloads for Improved Resource Manage-
ment in Large Cloud Platforms. In Proceedings of the 26th Symposium
on Operating Systems Principles (SOSP ’17). ACM, New York, NY,
USA, 153–167. https://doi.org/10.1145/3132747.3132772
[22] Guilherme Cox and Abhishek Bhattacharjee. 2017. Efficient Ad-
dress Translation for Architectures with Multiple Page Sizes. In
Proceedings of the Twenty-Second International Conference on Ar-
chitectural Support for Programming Languages and Operating
Systems (ASPLOS ’17). ACM, New York, NY, USA, 435–448.
https://doi.org/10.1145/3037697.3037704
[23] filter [n.d.]. Nova filter scheduler.
http://docs.openstack.org/developer/nova/filter_scheduler.html .
[24] Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M.
Swift. 2014. Efficient Memory Virtualization: Reducing Dimen-
sionality of Nested Page Walks. In Proceedings of the 47th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO-
47). IEEE Computer Society, Washington, DC, USA, 178–189.
https://doi.org/10.1109/MICRO.2014.37
[25] Jayneel Gandhi, Mark D. Hill, and Michael M. Swift. 2016. Ag-
ile Paging: Exceeding the Best of Nested and Shadow Paging. In
Proceedings of the 43rd International Symposium on Computer Ar-
chitecture (ISCA ’16). IEEE Press, Piscataway, NJ, USA, 707–718.
https://doi.org/10.1109/ISCA.2016.67
[26] Swapnil Haria, Mark D. Hill, and Michael M. Swift. 2018.
Devirtualizing Memory in Heterogeneous Systems. In Proceed-
ings of the Twenty-Third International Conference on Architec-
tural Support for Programming Languages and Operating Sys-
tems (ASPLOS ’18). ACM, New York, NY, USA, 637–650.
https://doi.org/10.1145/3173162.3173194
[27] Nikhita Kunati and Michael M. Swift. 2018. Implementation of Di-
rect Segments on a RISC-V Processor. In IN Second Workshop on
Computer Architecture Research with RISC-V (CARRV), Co-located
with ISCA.
[28] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Ross-
bach, and Emmett Witchel. 2016. Coordinated and Efficient
Huge Page Management with Ingens. In Proceedings of the 12th
USENIX Conference on Operating Systems Design and Implementa-
tion (OSDI’16). USENIX Association, Berkeley, CA, USA, 705–721.
http://dl.acm.org/citation.cfm?id=3026877.3026931
[29] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach,
and Emmett Witchel. 2017. Ingens: Huge Page Support for the OS
and Hypervisor. SIGOPS Oper. Syst. Rev. 51, 1 (Sept. 2017), 83–93.
https://doi.org/10.1145/3139645.3139659
[30] Yashwant Marathe, Nagendra Gulur, Jee Ho Ryoo, Shuang Song, and
Lizy K. John. 2017. CSALT: Context Switch Aware Large TLB. In
Proceedings of the 50th Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO-50 ’17). ACM, New York, NY, USA,
449–462. https://doi.org/10.1145/3123939.3124549
[31] Vlad Nitu, Aram Kocharyan, Hannas Yaya, Alain Tchana, Daniel Hag-
imont, and Hrachya Astsatryan. 2018. Working Set Size Estimation
Techniques in Virtualized Environments: One Size Does Not Fit All.
Proc. ACM Meas. Anal. Comput. Syst. 2, 1, Article 19 (April 2018),
22 pages. https://doi.org/10.1145/3179422
[32] Vlad Nitu, Pierre Olivier, Alain Tchana, Daniel Chiba, Antonio
Barbalace, Daniel Hagimont, and Binoy Ravindran. 2017. Swift
Birth and Quick Death: Enabling Fast Parallel Guest Boot and
Destruction in the Xen Hypervisor. In Proceedings of the 13th
ACM SIGPLAN/SIGOPS International Conference on Virtual Exe-
cution Environments (VEE ’17). ACM, New York, NY, USA, 1–14.
https://doi.org/10.1145/3050748.3050758
13
[33] Ashish Panwar, Aravinda Prasad, and K. Gopinath. 2018. Making
Huge Pages Actually Useful. In Proceedings of the Twenty-Third Inter-
national Conference on Architectural Support for Programming Lan-
guages and Operating Systems (ASPLOS ’18). ACM, New York, NY,
USA, 679–692. https://doi.org/10.1145/3173162.3173203
[34] Chang Hyun Park, Taekyung Heo, Jungi Jeong, and Jaehyuk Huh.
2017. Hybrid TLB Coalescing: Improving TLB Translation Cov-
erage Under Diverse Fragmented Memory Allocations. In Proceed-
ings of the 44th Annual International Symposium on Computer
Architecture (ISCA ’17). ACM, New York, NY, USA, 444–456.
https://doi.org/10.1145/3079856.3080217
[35] Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhat-
tacharjee. 2015. Large Pages and Lightweight Memory Man-
agement in Virtualized Environments: Can You Have It Both
Ways?. In Proceedings of the 48th International Symposium on Mi-
croarchitecture (MICRO-48). ACM, New York, NY, USA, 1–12.
https://doi.org/10.1145/2830772.2830773
[36] R. Raman, M. Livny, and M. Solomon. 1998. Matchmak-
ing: Distributed Resource Management for High Throughput
Computing. In Proceedings of the 7th IEEE International Sym-
posium on High Performance Distributed Computing (HPDC
’98). IEEE Computer Society, Washington, DC, USA, 140–.
http://dl.acm.org/citation.cfm?id=822083.823222
[37] Jee Ho Ryoo, Nagendra Gulur, Shuang Song, and Lizy K.
John. 2017. Rethinking TLB Designs in Virtualized Environ-
ments: A Very Large Part-of-Memory TLB. In Proceedings
of the 44th Annual International Symposium on Computer Ar-
chitecture (ISCA ’17). ACM, New York, NY, USA, 469–480.
https://doi.org/10.1145/3079856.3080210
[38] Tom Shanley. 1996. Pentium Pro Processor System Architecture (1st
ed.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA,
USA.
[39] Siqi Shen, Vincent van Beek, and Alexandru Iosup. 2015. Statisti-
cal Characterization of Business-Critical Workloads Hosted in Cloud
Datacenters. In 15th IEEE/ACM International Symposium on Cluster,
Cloud and Grid Computing, CCGrid 2015, Shenzhen, China, May 4-7,
2015. 465–474.
[40] Cristan Szmajda and Gernot Heiser. 2003. Variable Radix Page Table:
A Page Table for Modern Architectures. In Advances in Computer
Systems Architecture, Amos Omondi and Stanislav Sedukhin (Eds.).
Springer Berlin Heidelberg, Berlin, Heidelberg, 290–304.
[41] M. Talluri, M. D. Hill, and Y. A. Khalidi. 1995. A New Page Table for
64-bit Address Spaces. In Proceedings of the Fifteenth ACM Sympo-
sium on Operating Systems Principles (SOSP ’95). ACM, New York,
NY, USA, 184–200. https://doi.org/10.1145/224056.224071
[42] Rich Uhlig, Gil Neiger, Dion Rodgers, Amy L. Santoni, Fer-
nando C. M. Martins, Andrew V. Anderson, Steven M. Bennett,
Alain Kagi, Felix H. Leung, and Larry Smith. 2005. Intel Vir-
tualization Technology. Computer 38, 5 (May 2005), 48–56.
https://doi.org/10.1109/MC.2005.163
[43] Carl A. Waldspurger. 2002. Memory Resource Management in
VMware ESX Server. SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002),
181–194. https://doi.org/10.1145/844128.844146
[44] Xiaolin Wang, Jiarui Zang, Zhenlin Wang, Yingwei Luo, and Xiaom-
ing Li. 2011. Selective Hardware/Software Memory Virtualization. In
Proceedings of the 7th ACM SIGPLAN/SIGOPS International Confer-
ence on Virtual Execution Environments (VEE ’11). ACM, New York,
NY, USA, 217–226. https://doi.org/10.1145/1952682.1952710
[45] Idan Yaniv and Dan Tsafrir. 2016. Hash, Don’T Cache (the Page
Table). In Proceedings of the 2016 ACM SIGMETRICS Interna-
tional Conference on Measurement and Modeling of Computer Sci-
ence (SIGMETRICS ’16). ACM, New York, NY, USA, 337–350.
https://doi.org/10.1145/2896377.2901456
14
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  2000  4000  6000  8000  10000  12000
CD
F 
fo
r B
itb
ra
in
# Memory segments
BaseLine
ImprovPlacement+Opt1
ImprovPlacement+Opt2
DynamicOptionSelec
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  2000  4000  6000  8000  10000  12000
CD
F 
fo
r C
ER
IT
BaseLine
ImprovPlacement+Opt1
ImprovPlacement+Opt2
DynamicOptionSelec
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  2000  4000  6000  8000  10000  12000
CD
F 
fo
r A
zu
re
BaseLine
ImprovPlacement+Opt1
ImprovPlacement+Opt2
DynamicOptionSelec
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  2000  4000  6000  8000  10000  12000
CD
F 
fo
r X
X
X
X
X
X
BaseLine
ImprovPlacement+Opt1
ImprovPlacement+Opt2
DynamicOptionSelec
