Tip of the Iceberg: Low-Associativity Paging by Mukherjee, Nirjhar
Honors Thesis
Tip of the Iceberg: Low-Associativity Paging
Nirjhar Mukherjee
Department of Computer Science








Virtual address translation is a growing bottleneck for large data appli-
cations — such as machine learning and graph analytics — as each virtual
address translation requires multiple, slow, memory accesses. TLBs at-
tempt to solve this problem, but have limited coverage and cannot keep
up with the rate at which virtual memory scales.
This paper argues that a path forward is to reconsider the necessity of
fully associative page translations, where any virtual page can map to any
physical location. Reduced associativity, as in hashed page tables, can im-
prove TLB coverage by reducing the number of bits needed to cache one
translation. Unfortunately, reduced associativity is often implemented us-
ing techniques that cause usability problems that stem from associativity
conflicts.
We summarize a new hashing scheme, iceberg hashing, that is suit-
able for virtual memory, as it successfully addresses flaws in prior hashing
schemes.
This thesis describes how iceberg hashing can be implemented and
used in a virtual memory system. Finally, the thesis presents a study and
preliminary data using xv6, which indicates iceberg hashing can be config-
ured such that associativity conflicts only manifest when memory is more




2 State of the Art 7
2.1 Single . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Greedy[d] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Left[d] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Cuckoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Analysis 10
4 Iceberg Hashing 12
4.1 The General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 Implementation Details 14
5.1 Boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Free Page List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.3 Page Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3.1 Userspace Page Allocation . . . . . . . . . . . . . . . . . . 15
5.4 Page Free . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6 Page Tables 16
7 Prototype and Experiments 18
7.1 Xv6 Prototype and States . . . . . . . . . . . . . . . . . . . . . . 18
7.2 Initial State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.2.1 Experiments, Tuning, and Results . . . . . . . . . . . . . 18
7.2.2 The Early Eviction Problem and 5% LRU . . . . . . . . . 20
7.3 Steady State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.3.1 LRU Designs . . . . . . . . . . . . . . . . . . . . . . . . . 21
8 Current Research and Related Work 22
8.1 Linux and the Second Prototype . . . . . . . . . . . . . . . . . . 22




1 Single Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Greedy[d] Hashing, d = 4 . . . . . . . . . . . . . . . . . . . . . . 9
3 Left[d] Hashing, d = 4. The four partitions are visualized using
dotted lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 A Comparison of the State of the Art and Iceberg hashing . . . . 11
5 The iceberg hashing algorithm . . . . . . . . . . . . . . . . . . . 13
6 Array buckets stores all userspace physical page frames in two
levels using freelists . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7 Address translation mechanism. Page table entry uses the first
encoding described (three bits for bucket, five for offset). Bucket
size = 32, level1size = 26, d=7 . . . . . . . . . . . . . . . . . . . 17
List of Tables
1 Number of associativity conflicts in 56,304 page allocations as a
function of bucket size b. In this experiment, d = 4, level 2 size
=
√
b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 b = 8, level 2 size = 3. Number of associativity conflicts in 56,304
page allocations as a function of iceberg hashing parameter d. . . 19
3 b = 32, level 2 size = 6. Number of associativity conflicts in
56,304 page allocations as a function of iceberg hashing parameter
d. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 b = 64, level 2 size = 8. Number of associativity conflicts in
56,304 page allocations as a function of iceberg hashing parameter
d. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Number of associativity conflicts and early evictions in 54,272
page allocations. In this experiment, b = 32 and d = 6. . . . . . . 21
4
1 Introduction
Virtual memory is ubiquitous in modern computer systems because of the flex-
ibility it affords programmers, but it comes with the cost of translating every
memory access from a virtual to a physical address. This translation can be ex-
pensive, often requiring multiple RAM accesses and spanning hundreds of CPU
cycles. On modern x86 CPUs with virtual-machine support (VT-x or NPT), the
worst-case translation cost can be 24 RAM reads [1]. To mitigate these costs,
CPUs include a Translation Lookaside Buffer (TLB), a small hardware
cache of virtual-to-physical mappings.
A key issue for modern workloads is TLB coverage , i.e. how many virtual
addresses can be cached in the TLB. Not only do modern workloads operate
on increasingly large data sets, but also emerging classes of applications, such
as machine learning and graph analytics, are dominated by pointer chasing and
other access patterns that are difficult to predict in hardware. Typical TLBs
are only able to cover tens of MB of address space. For example, the L1 and L2
TLBs in recent Intel Skylake chips have a combined 1600 entries, each mapping a
single 4kB page [2]. These entries cover only 6.5MB of RAM1. As a result, many
of these applications report 20–30% overhead attributable to TLB misses [3, 4,
5, 6], and some as high as 83% [7]
There are core electrical-engineering reasons why TLBs remain small. The
microarchitectural structures for implementing a fully associative cache (a con-
tent addressable memory, or CAM [8]) face hard physical constraints that force
them to trade size for access latency. These structures also tend to be power-
hungry, consuming 3-13% [9] of a processor’s power. Furthermore, higher asso-
ciativity and a deeper TLB hierarchy also increase dynamic energy usage [9].
The standard solution to TLB scaling limits is to increase the granularity of
translation. Most commonly, this takes the form of increasing the page size (re-
sulting in so-called huge pages)[10, 11] although other recent work also revisits
segments [12] and other variably-sized, physically contiguous translations [13].
Increased translation granularity naturally increases the TLB coverage.
There are several downsides to increasing the granularity of translations,
including increased internal fragmentation and increased space wastage. Mixed
translation sizes are also problematic for OS developers, who must now maintain
physical contiguity of huge pages while battling external fragmentations [6, 11,
14]. Short of a breakthrough in memory defragmentation, increased translation
granularity pushes difficult problems onto OS developers.
In this thesis, we observe that there is a latent assumption in virtual-memory
design that should be reconsidered: that page tables and translations need to be
fully associative . A fully associative mapping means that any virtual address
can be mapped to any physical address; in contrast, a mapping is said to be `-
way associative if each virtual address has only ` possible physical addresses
that it could be mapped to. Colloquially `-way associative RAM is called low-
1Some of these entries may optionally map to 2MB pages. Intel TLBs also have a small
number of additional entries for huge pages, which we discuss below.
5
associativity when ` << total number of physical addresses. Full associativity
is provided in all modern operating systems, TLBs, and page tables.
The reason that reducing associativity is relevant to TLB coverage is that, if
virtual-to-physical translations were to have low associativity (rather than full
associativity), then the TLB coverage problem would become much easier. In
particular, low associativity would mean that the TLB could store fewer bits
for each translation; rather than storing the entire physical address, the TLB
could simply store an encoded location within a smaller set of pages. Thus we
can achieve higher TLB coverage by lowering RAM associativity.
This thesis explores the following question: Is a low-associativity RAM vi-
able?
The main drawback of low associativity is the potential for associativity
conflicts, i.e., all ` of the physical addresses where a page could reside are
already taken. Associativity conflicts can cause additional RAM and Disk I/O
operations because they force pages to sometimes be evicted from RAM even
when RAM is under capacity.
Designing an `-way associative address-translation scheme is analogous to
designing a hash table in which there are exactly ` positions where each element
can reside. This analogy allows us to bring the rich theory of hash tables to bear
on the low-associativity address-translation problem. Associativity conflicts in
the address-translation scheme correspond to unresolvable collisions in the hash
table (i.e., insertions where none of the ` positions for a key are free).
In order for hash tables to be useful for low-associativity address translation,
we want for the probability of associativity conflicts to be bounded as very small.
In the algorithms literature, the probability of a unresolvable collision in a hash
table is usually bounded as a function of the overall fullness of the table; typically
there is an inflection point, or fullness threshold , above which collisions are
probable. For many hashing schemes, the fullness threshold is at or below 50%.
An ideal hash table for paging would have a low associativity and a high fullness
threshold, i.e., very low probability of conflict even with most (over ∼90%) of
physical memory allocated.2
This thesis argues that a new hashing scheme, called iceberg hashing ,
has all the desirable properties we want in an address-translation scheme. The
algorithm itself and the proofs of its asymptotic properties are under concurrent
submission to a theory conference, with the thesis author as a second author
3. Here we show empirically that iceberg hashing has a high fullness threshold
(over 90%), below which associativity conflicts have negligible probability. We
show how to use iceberg hashing to achieve low-associativity in RAM, resulting
in what we call iceberg paging .
2Moreover, we also want our hash table to be stable, meaning that once an element is
inserted it does not get subsequently moved. This rules out, for example, cuckoo hashing,
which in past work [15] has been shown to be a promising data structure for page-table design
(but not necessarily for determining address-translation mappings).
3Authors: Michael Bender, Abhishek Bhattacharjee, Alex Conway, Martin Farach-Colton,
Rob Johnson, Sudarsun Kannan, William Kuszmaul, Nirjhar Mukherjee, Don Porter, Guido
Tagliavini, Janet Vorobyeva and Evan West
6
The goal of this thesis is to empirically evaluate the resilience of iceberg
paging to associativity conflicts. Our main empirical result is that, based on a
prototype implementation of iceberg paging in the xv6 operating system, there
is preliminary evidence that associativity conflicts are not problematic. We
show that, when RAM is maintained with a small amount of reserved capacity
(meaning there is ∼5% free space), then associativity conflicts almost never
occur. And even when RAM is maintained at full capacity, associativity conflicts
can almost always be resolved by evicting a page that is among the 5% of oldest
pages.
2 State of the Art
This section explores four popular low-associativity hashed paging schemes and
briefly formalizes the tools we need to analyze them. This helps us propose a
better way to implement low associativity paging, by remedying the pitfalls of
the current state of the art.
To experiment with low associativity paging, we need a mechanism to easily
control the associativity of virtual-to-physical page mappings — for simplifying
analysis and experiments. This is achieved by abstracting a fixed-size collection
of physical page frames called a bucket . Thus, in a hashed page allocation
request, a hash function operates on the virtual address to derive a bucket
number. A free physical page frame from the derived bucket is then used to
service the page allocation request. Thus, the associativity of address translation
can be controlled by changing the number of physical page frames in a bucket,
or in other words, changing the bucket size.
In a system, bucket sizes are unbounded. However, in reality, buckets store
physical addresses and must have a fixed size. Thus, a collision happens when
no free physical page frames are available in a bucket.
The type of hash scheme used affects the frequency of collisions. This is
because the load of a bucket — defined as the number of allocated physical
page frames in that bucket — is often a function of the number of virtual pages
in the system; and the faster this function grows, the higher the number of
collisions that happen in a fixed time interval will be.
Thus, a system’s load factor — defined as the load of the bucket which
contains the most allocated physical page frames — is a good indicator of the
number of collisions to expect in a system. Maximum loads are derived us-
ing balls-and-bins analysis on these techniques which, is beyond the scope of
this paper. They are listed in the subsections below and their derivations are
available for perusal in the referenced papers.
2.1 Single
Single or power-of-1-choice hashing uses a single hash function to on an input
to (deterministically) compute an output, as shown in Figure 1. This is the
most ubiquitous method of hashing, often found in userspace implementations
7
like the C++ unordered map. As seen in the analysis section, single either
needs too many additional physical pages in a bucket (high insertion variance)
or requires too many bits to represent a page mapping.
If P = number of physical pages, n = number of buckets, h = Pn is the




Figure 1: Single Hashing
2.2 Greedy[d]
Greedy[d] (also known as power-of-d-choice hashing or d-choice hashing) is an
improvement on single and attempts to lower collision probability by using d
hash functions. Each of the d hash functions behaves like a single hash function,
to derive a bucket. This is pictured in Figure 2
Greedy[d] then selects the bucket with the least load (most free physical page
frames) amongst the derived buckets. In case of a tie, a winner is randomly
picked.






Left[d], like greedy[d], also uses d hash functions. The range of Left[d] (the
physical address space) is divided into d equal range partitions that are assigned
8
Figure 2: Greedy[d] Hashing, d = 4
to the d hash functions. These hash functions derive a bucket (each) that lies
in their assigned range partition.
Left[d] then selects one bucket from these d using the same criteria as
greedy[d]. The only difference is that left[d] breaks ties asymmetrically, always
choosing the leftmost of the tied buckets. An example is provided in Figure 3.





Here, φd = limk→∞
d
√




Cuckoo Hashing uses the strengths of greedy[d] (with d=2) in a slightly different
manner. It uses two different hash functions to find two buckets, and then picks
a final bucket out of them.
In case of a collision in a bucket, one of the used pages of that bucket
is zeroed out to be replaced, destroying an existing virtual to physical page
mapping. The virtual page from the destroyed mapping now requests a new
physical page allocation. The request is serviced by selecting a new physical
page frame from its other bucket choice, which may trigger another collision,
adding some tail latency.
A load factor analysis in this situation does not make sense as the limiting
factor of this technique is not the load factor but rather long tail latencies caused
by collisions.
9
Figure 3: Left[d] Hashing, d = 4. The four partitions are visualized using dotted
lines
3 Analysis
This section deals with questions regarding the usefulness of the four above
techniques in a real kernel. More specifically, this section formalizes and analyses
properties such as space overhead and TLB entry size.
We first derive a function (in P = the number of pages and n = number
of buckets) for system load for these hashing schemes. We then express the
function as a sum of average load and variance. This allows us to compare and
observe how the required memory overhead behaves for different bucket sizes.
We also derive compare the number of bits each hash function, at a particular
bucket size (P/n), would require in the TLB.
This allows us to observe how bad the existing hashing are for use in paging,
and allows us to find the desirable properties we would like to see in an ideal
hash function.
Let us first analyze the memory overhead for single hashing. We know that





(P/n)log(n)). So if we choose the number of buckets as n = P/log(P ), the
average load is h = log(P ) and the load factor is log(P ) + O(
√
log(P )log(n)).
In most cases, P >> n, as we have many more pages than buckets. Thus
the load factor can also be written as log(P ) + O(log(P )) = average load +
maximum variance. This bound gives a variance of O(log(P )), hence, we should
reserve O(log(P )) extra pages per bucket, or a system-wide total of O(log(P ))∗
(P/log(P )) pages. This can be simplified to O(1)P pages. This means, for
any system with P pages, and a big-O order term of, say 10, we might need to
reserve 10P = 1000% extra memory, which, clearly, is not ideal.
We can significantly reduce this memory overhead by choosing larger buck-
10
Figure 4: A Comparison of the State of the Art and Iceberg hashing
ets (and thus fewer buckets), say n = P/log2(P ). The average load becomes
h = log2(P ) and the load factor becomes log2(P ) + O(
√
log3(P )) = log2(P ) +
O(log1.5(P )) = averageload+maximumvariance. Thus, this gives a variance
of O(log1.5(P )). This means that we should reserve O(log1.5(P )) pages per
bucket, or a total of O(log1.5(P )) ∗P/log2(P ) = O(1/
√
logP )P pages. So, for a
system with 64GB = 224 pages (4kB pages) of memory, and a big-O order term
of 10, we would need to reserve 10(1/
√
logP )P = 10(1/5) ∗ P = 2P = 200%
extra pages of memory, better than the previous case.
However, increasing the bucket size is not ideal for real world usage, as if we
choose large buckets, their offset representation would need more bits, in this
case, 2loglog(224) = 10 bits. This is not terrible but the other schemes, like
greedy, left, cuckoo and iceberg do much better (in orders of logloglogP, as seen
in the next few paragraphs).
Thus single hashing either has high memory overheads, or requires too many
bits to represent a mapping in the TLB. This can be seen in column two of
Figure 4
If we compute the memory overhead for greedy[d] and left[d], we observe that
they, at a first glance, seem like a viable option. For n = P/log2log(P ) buckets,
a space overhead of just δ = O(1/loglog(P )) can be derived for both techniques.
This means that for a system with 64GB = 224 pages (4kB) of memory and a
big-O order term of 10, we would need to reserve just 10(1/loglog(224))P =
2P = 200% extra pages of memory (If we change the bucket size from log2log(P )
to log3log(P ) the memory overhead falls to 10/log2log(P ) = 10/25 = 40% of
memory). If we analyze the number of bits required to represent bucket offsets
for greedy[d] and left[d] in the TLB, we would require d2logloglog(P )e = 5 bits
(For the large bucket size mentioned above, we would require d3logloglog(P )e =
7 bits).
11
However, this seemingly amazing bound for greedy[d] and left[d] hashing —
on both memory overhead and required TLB representation size — have a fatal
flaw: the known bounds on load factor do not hold in the presence of deletions4.
In a paging system, we will be constantly inserting, removing, and reinserting
pages into buckets. Hence, in an ideal world, the bounds mentioned above need
to be held under dynamic paging conditions. This shortfall is marked in column
three of Figure 4.
Finally, Cuckoo Hashing, while seemingly ideal, has some hidden pitfalls.
Cuckoo Hashing can reduce buckets to a very small size while maintaining
low memory overheads, but requires kicking pages from one bucket to another,
which, incurs high memory copy costs. It would also make it difficult to do page
allocation concurrently, and cause long tail latencies in page allocation times.
Rather, we want a stable hashing scheme, that is, a scheme where pages do not
get moved after their initial placement. This is represented by columns one and
4 of Figure 4
Thus, we wish to develop Iceberg hashing to have all of the desirable prop-
erties mentioned above — stability, low memory overhead, and low TLB repre-
sentation overhead; even in the dynamic case, while avoiding tail latencies.
4 Iceberg Hashing
We now describe a new hashing scheme, iceberg hashing, that is stable, and
enables us to use very small buckets. This allows us to reduce TLB entries to
a handful of bits per page whilst maintaining a low memory overhead, even in
the dynamic case.
In the interest of self containment iceberg hashing is described here. However
iceberg hashing is not a contribution of this paper (rather, its implementation
and empirical results derived on it are). As previously mentioned, a concurrent
submission to a theory conference, with the thesis author as a second author,
includes a complete presentation of a more sophisticated variant of iceberg hash-
ing and proofs of its key properties 5. The implementation and empirical results
presented in this thesis are also under concurrent submission to a systems con-
ference, with the thesis author as first author 6.
4.1 The General Case
Iceberg hashing combines single hashing (for most mappings) and left[d] hashing
(for a few mappings, to limit the occupancy of any bucket); This is analogous
4Weaker bounds have been shown in this case [18], but they cannot be used to guarantee
low memory overhead.
5Authors: Michael Bender, Abhishek Bhattacharjee, Alex Conway, Martin Farach-Colton,
Rob Johnson, Sudarsun Kannan, William Kuszmaul, Nirjhar Mukherjee, Don Porter, Guido
Tagliavini, Janet Vorobyeva and Evan West
6Authors: Nirjhar Mukherjee, William Kuszmaul, Guido Tagliavini, Evan West, Michael
Bender, Alex Conway, Martin Farach-Colton, Rob Johnson, Sudarsun Kannan, Don Porter,
Jun Yuan
12
Iceberg[d]: Let h be the average bin occupancy. Set δ ≈
√
(log h)/h. Let
τ = (1 + δ/2)h.
1. We select a bucket with Single. If there are fewer than τ pages at
level-1 in that bucket, the page is placed there at level-1.
2. Else, if there are less than O(n) level-2 pages in the system, then the
page selects a new bucket using LEFT[d], where we consider only the
level-2 pages when deciding which bucket is emptiest.
3. Else, the page selects its original bucket (the choice of Single) and is
labeled level 3.
Figure 5: The iceberg hashing algorithm
to how most of an iceberg is hidden underwater.
A bucket in iceberg hashing has size ≈ log2 logP . Pages in each bucket are
marked as being in one of three levels.
Level-1 in a bucket has maximum capacity τ . Level-2 in iceberg hashing
has a maximum capacity of φ = O(n) for the whole system (i.e. total number
of level-2 pages in all the buckets, combined, must be less than or equal to φ).
Level-3 in a bucket has unbounded capacity.
The general iceberg hashing algorithm can be generalized as follows:
1. We first try to allocate a page to a bucket using single hashing. If this
succeeds, then the page is added to the bucket, marked as a level-1 page.
2. If this fails, due to level-1 of the bucket being full, we try to allocate the
page to a bucket using Left[d] hashing, where we consider only the level-2
pages when deciding which bucket is emptiest. If this succeeds, then the
page is marked as a level-2 page.
3. If both our level-1 and level-2 allocation attempts fail, then we place the
page in the bucket chosen by single for level-1 insertion and mark the page
as level-3.
Note that all d + 1 hash functions are independent. The general iceberg
hashing algorithm is summarized for convenience in Figure 5
Thus any page can be allocated into one of d+ 1 buckets, and can be in any
of ≈ log2 logP positions in each of those buckets. If d = O(1), then this means
that the corresponding hashed paging scheme is `-associative for ` ≈ log2 logP .
This, in turn, means that TLB entries need only log ` ≈ log log logP bits per
page.
And, amazingly, iceberg hashing ensures that buckets will almost never over-
flow.
13
Theorem 1 Consider an iceberg hashing scheme with n buckets, in which, we
run an infinite process of inserting and deleting pages. Then, with high probabil-
ity at every moment in time, the fullest bucket has h+O(log log n)+o(h)+O(1)
pages, where h is the average number of pages per bucket.
This means we get the smoothness guarantees of left[d] hashing at all moments
in time, enabling us to have low memory overhead and low associativity, even
under mapping deletions and page re-insertions.
5 Implementation Details
This section explains the design and prototype implementation of paging in the
Xv6 OS and Linux kernel. We elide implementing some key features of memory
management that would be needed for a production-quality kernel.
One key fact to keep in mind is that while buckets in the theoretical deriva-
tions above are unbounded, that is not the case in a real system. Further, it is
not feasible to reserve level-3 pages for allocation based on the extremely low
probability of the event, as otherwise precious memory would be left unused and
wasted. Thus, in real system, buckets are bounded and level-3 does not exist.
In case a level-3 page actually gets inserted into the system, it is treated as an
associativity conflict (and gets handled by the LRU).
5.1 Boot
On startup, the bootloader initializes the CPU and other components, loads
the kernel into memory, sets up some kernel memory mappings, and hands
over execution to the kernel. Thus to implement iceberg hashing in the kernel,
we would require editing the bootloader to be iceberg aware — a task with
diminishing returns. Our strategy then, is to keep kernel memory working the
way it does, i.e, regular forward-mapped paging, and implement iceberg hashing
for userspace memory — effectively allowing us to ignore the existence of the
bootloader.
5.2 Free Page List
At startup, before anything, the kernel must set up the free physical memory
list, or freelist, such that on-demand allocation of memory to both, the kernel
and the user, is possible.
The kernel maintains a singly-linked list, whose nodes are inlined into the
first few bytes of a free physical page. This allows maintenance of free physical
pages in an easily accessible manner (both for allocation and free). The list
node only has a next pointer pointing to the next element of the list, and by
construction, the next free page. The address of the first node/page is stored in
a global variable, freelist.
For iceberg hashing, the idea of buckets needs to be ingrained into physical
pages. Thus, instead of a global freelist pointer, we maintain a global array,
14
struct bucket {
short int l e v e l 1 u t i l i z a t i o n ; //number o f pages used in l e v e l 1
short int l e v e l 2 u t i l i z a t i o n ; //number o f pages used in l e v e l 2
struct run ∗ f r e e l i s t l e v e l 1 ;
struct run ∗ f r e e l i s t l e v e l 2 ;
} buckets [B ] ; // B i s number o f buckets , hardcoded
Figure 6: Array buckets stores all userspace physical page frames in two levels
using freelists
buckets. Each element of buckets has a load counter for each of its two levels
— level1utilization and level2utilization — and a two freelist pointers — freel-
istlevel1 and freelistlevel2 — maintaining freelists of the pages in that level in
the bucket.
On startup, each page at and after the 4MB boundary (all pages below that
are kernel pages and are listed in the global freelist structure) are inserted into
buckets. A page with address (4 +x)MB is first normalized to xMB, and then
assigned into bucket b((x/pagesize)/b)c (b is the size of one bucket). The page is
then assigned to the freelist of its corresponding level, calculated using modulo
arithmetic. Finally, level1utilization and level2utilization are both initialized to
0.
For instance, for b = 32, if we assign 26 level1 pages in a bucket, then
the physical page with address 136kB after 4MB will be normalized to 136kB
and assigned to bucket b(136kB/4kB)/(32)c = 1, and into freelistlevel1, as
((136kB/4kB)%32) = 2 < 26.
5.3 Page Allocation
A page allocation request may be initiated from multiple places in the kernel.
However, each call must now be manually separated into one of two: kalloc()
for kernel space allocations, and ihalloc() for userspace allocations.
kalloc() allocates and returns a physical page frame from the freelist, as
long as one is available. Similarly, ihkalloc() returns a userspace physical page
frame but is much more complex and is discussed in the next section. The page
returned by these functions is then used to service the page allocation request.
5.3.1 Userspace Page Allocation
The userspace allocation function, ihalloc(), takes in two parameters, a virtual
address and a pid, and returns a physical page frame according to the iceberg
hashing scheme.
First, as for all hashing, a key is generated. The key, k is just a string
concatenation: < pid, virtualaddress >. This key, by construction, is always
unique for any virtual page of any arbitrary process.
This key is now used in conjunction with a predefined random global integer
seed, s1 to generate a bucket number using a hash function h(k, s1). This points
15
to a bucket that services the level-1 allocation of this ongoing request.
If the freelistlevel1 structure is empty (if level1utlization == level-1 size),
that implies that level-1 insertion has failed and we need to attempt a level-2
insertion.
To allocate a physical page frame from level-2, we need d candidate buckets.
These bucket numbers are computed using predefined global integer seeds s2
through sd+1 via hash function calls h2(k, s2, d) through h2(k, sd+1, d). Note
that the ith h2 call does not return a bucket number from the whole physical page
frame range P , but rather on its partition, calculated as [(i-1)(P/d), (iP/d)),
keeping in line with the rules of left[d]. We then choose the final bucket by
selecting the candidate bucket with the lowest level2utilization value. Ties are
broken in ascending order. The chosen bucket services the allocation using a
page from its freelistlevel2 structure.
In case this list is also empty, we have an associativity conflict and enter
allocation failure mode. For now, these conflicts are counted and serviced by a
small portion of reserved memory (that cannot be utilized by the kernel or user
allocation functions directly). Section 6 details how these conflicts are resolved
in the iceberg scheme.
Finally, before the ihalloc() function returns the new physical page frame, it
increments the correct levelutilization variable (of the correct bucket) by one.
5.4 Page Free
Page free, like page allocation, is also divided into two functions, kfree() and
ihfree(). Both functions perform the same action: zeroing out the page (with
memset), followed by inserting it into an available page data structure.
In the case of kfree, the page gets inserted into the global freelist. In the case
of ihfree, the page, like during initialization, is inserted into its corresponding
bucket and, by extension, freelistlevel1 or freelistlevel2. The corresponding level
utilization variable is also decremented by 1.
6 Page Tables
This section discusses how to encode a page translation mapping, how to create
the encoding, how to look up a physical page frame from the encoding, and how
to store the encoded mapping in the page table and TLB.
One of the main strengths of iceberg hashing is that we can use very few
bits to represent a page mapping. A page mapping needs to have just enough
information to know where a virtual page’s corresponding physical page frame
is located. Thus, all we need are the following:
• Hash Seed — The seed determines which hash function to use and returns
the correct bucket; while also telling us information on which level freelist
the physical page frame exists in. So if seed-3 is used, we know we should
use h2() and look in freelistlevel2, while seed-1 would require us to use h()
and look in freelistlevel1.
16
• Bucket Offset — once we have a bucket (and freelist), all we need is an
offset into the bucket (into the correct freelist). So for a bucket size of,
say, 32, the offset would be 0 through 25 (for level-1) or 0 through 5 (for
level-2).
When a page is allocated and to be recorded in a page table, our ihalloc()
function can, trivially, return the number of the seed used (and thus the bucket
number) and we can calculate from the seed and the physical address of the
returned page, the offset of the physical page frame within the bucket.
Since what we want is a minimal fixed-length encoding to store the above
information, we can use a scheme as follow: the first three bits in byte represent
the seed, and the remaining five bits represent an offset. This is pictured in
Figure 7.
Figure 7: Address translation mechanism. Page table entry uses the first encod-
ing described (three bits for bucket, five for offset). Bucket size = 32, level1size
= 26, d=7
Other minimal fixed-length encodings are also viable but may require more
latency to decode in hardware after a TLB read. A different viable encoding
that we have discovered simply unrolls the possible states into a single integer
value. So for a bucket size of 32, level-1 size of 26, and d = 4 (so five seeds)
we can represent the page using an int L. Thus, we can create and look up the
encoding and physical page frame, to and from L, as follows:
L ∈ [00, 25] ∩ Z ⇐⇒ Seed = 1, Offset = L− 00
L ∈ [26, 31] ∩ Z ⇐⇒ Seed = 2, Offset = L− 26
L ∈ [32, 37] ∩ Z ⇐⇒ Seed = 3, Offset = L− 32
L ∈ [38, 43] ∩ Z ⇐⇒ Seed = 4, Offset = L− 38
17
L ∈ [43, 48] ∩ Z ⇐⇒ Seed = 5, Offset = L− 43
This means that we only need log(48) = 6 bits to store this information!
In the case of either encoding, one byte is more than enough for most realistic
systems. These one byte entries are stored as page table entries instead of the
traditional full physical address. When this byte is brought into a TLB, we
effectively increase TLB coverage as we save 7 bytes per entry (for a physical
address of length 8 bytes).
7 Prototype and Experiments
Now, that we have described the theoretical benefits, design, and implementa-
tion of iceberg hashing, we need to experimentally prove all the results and find
any discrepancies, and we do so in this section.
7.1 Xv6 Prototype and States
The first question that we try to answer is that are the associativity conflicts as
low as we predict? To answer this we implemented iceberg hashing in the Xv6
kernel. Xv6 is a great operating system for prototyping as it presents a simple
and efficient operating system, without too many ornate features. As a result,
it is very simple to implement iceberg hashing in its kernel.
We chose to implement only the allocation side of the design and ignore the
page table and TLB side. This is because we do not require it to empirically
evaluate the resilience of iceberg hashed paging to associativity conflicts, and
also because to test our TLB design we would require designing a custom TLB
in some simulation tool. So instead, the page tables simply store the physical
address of our allocated physical page frame, the same as in the forward mapped
paging scheme. But since all allocations obey the iceberg hashing rules, our
prototype would be able to capture any associativity conflicts that a full iceberg
paging system would experience.
We divide the system state into two parts: the initial state and the steady
state.
7.2 Initial State
The initial state is when the system starts up and all (userspace) physical mem-
ory is unallocated.
7.2.1 Experiments, Tuning, and Results
Reducing the bucket size reduced the number of bits required to represent a
physical page frame, but increases the probability of associativity conflicts.
Thus, it is very important to find the appropriate bucket size for real-world
use.
18
To find the answer to this question, we designed a micro-benchmark to mea-
sure the number of associativity conflicts in various bucket sizes. We then
manually find a bucket size that minimizes both associativity conflicts and en-
coding length. The C benchmark allocates a page, writes some random data,
and repeats. The allocated pages are never freed and continue to exist in the
system. The experiment stops when no more free memory is available in the
system, as that marks the end of the initial state. The results of the experiment
are detailed in Table 1.
Associativity Conflicts
for Fraction of Memory Allocated
b 85% 90% 95% 100%
8 0 4 (0.01%) 266 (0.47%) 2723 (4.84%)
16 0 0 (0.00%) 35 (0.06%) 1720 (3.05%)
32 0 0 (0.00%) 0 (0.00%) 867 (1.54%)
64 0 0 (0.00%) 0 (0.00%) 639 (1.13%)
128 0 0 (0.00%) 0 (0.00%) 391 (0.69%)
Table 1: Number of associativity conflicts in 56,304 page allocations as a func-
tion of bucket size b. In this experiment, d = 4, level 2 size =
√
b.
As we observe, a bucket size of 8 is very good for compressing the number of
bits required to represent a mapping in the TLB, but suffers from associativity
conflicts even before 90% of memory is filled. On the other hand, increasing
bucket size to 128 is great for reducing associativity conflicts but requires too
many bits to represent a mapping. However, a bucket size of 32 hits a sweet
spot for both of these issues.
Next, we wished to see the effects of varying the iceberg parameter d for our
empirically chosen bucket size of 32. We used the same benchmark and varied
d, and just for observation, we also ran the experiment for b = 8 and b = 64.
The results are shown in Tables 2, 3, and 4.
Associativity Conflicts
for Fraction of Memory Allocated
d 85% 90% 95% 100%
2 138 (0.25%) 526 (0.93%) 995 (1.77%) 1464 (2.60%)
3 1 (0.00%) 75 (0.13%) 560 (0.99%) 1123 (1.99%)
4 0 (0.00%) 4 (0.01%) 266 (0.47%) 2723 (4.84%)
5 0 (0.00%) 0 (0.00%) 155 (0.28%) 2288 (4.06%)
6 0 (0.00%) 0 (0.00%) 90 (0.16%) 1965 (3.49%)
Table 2: b = 8, level 2 size = 3. Number of associativity conflicts in 56,304 page
allocations as a function of iceberg hashing parameter d.
As we can see, the general trend is that the number of associativity conflicts
falls as d increases (albeit some variance). This is because associativity increases
19
Associativity Conflicts
for Fraction of Memory Allocated
d 85% 90% 95% 100%
2 0 (0.00%) 0 (0.00%) 32 (0.06%) 776 (1.38%)
3 0 (0.00%) 0 (0.00%) 0 (0.00%) 1071 (1.90%)
4 0 (0.00%) 0 (0.00%) 0 (0.00%) 867 (1.54%)
5 0 (0.00%) 0 (0.00%) 0 (0.00%) 625 (1.11%)
6 0 (0.00%) 0 (0.00%) 0 (0.00%) 831 (1.48%)
Table 3: b = 32, level 2 size = 6. Number of associativity conflicts in 56,304
page allocations as a function of iceberg hashing parameter d.
Associativity Conflicts
for Fraction of Memory Allocated
d 85% 90% 95% 100%
2 0 (0.00%) 0 (0.00%) 0 (0.00%) 819 (1.45%)
3 0 (0.00%) 0 (0.00%) 0 (0.00%) 677 (1.20%)
4 0 (0.00%) 0 (0.00%) 0 (0.00%) 639 (1.13%)
5 0 (0.00%) 0 (0.00%) 0 (0.00%) 624 (1.11%)
6 0 (0.00%) 0 (0.00%) 0 (0.00%) 531 (0.94%)
Table 4: b = 64, level 2 size = 8. Number of associativity conflicts in 56,304
page allocations as a function of iceberg hashing parameter d.
with d (higher d means a virtual page has more possible mapping locations).
Thus, on one hand we want to minimize d to keep the number of bits representing
a mapping small. On the other hand, making d too small increases the number
of associativity conflicts, forcing us — like b — to find a sweet spot.
7.2.2 The Early Eviction Problem and 5% LRU
However, associativity conflicts, even if very low, pose a problem. Due to asso-
ciativity conflicts, we have to prematurely evict pages to insert new ones, even
in the initial state. We hypothesize that an LRU, even a simple one, would be
able to solve this issue.
Xv6 does not support swapping and has no LRU. We implemented an ap-
proximation of LRU and a sanity check as follows. Rather than write pages to
disk, we reserved an 8MB portion of additional physical memory to act as a
small ramdisk. We track the last timestamp of each page. If a bucket has a col-
lision, we do a local replacement of the oldest page and check whether the oldest
page in the bucket is among the oldest 5% of pages globally. If a page being
evicted is not among the oldest 5% of pages system-wide, then we count that as
an early eviction . We never observed early eviction during our experiments,
as we can observe in Table 5.
After conducting all of the above experiments we concluded (empirically)
20
Fraction of memory allocated
90% 95% 100%
Associativity conflicts 0 0 631 (1.16%)
Early evictions 0 0 0 (0.00%)
Table 5: Number of associativity conflicts and early evictions in 54,272 page
allocations. In this experiment, b = 32 and d = 6.
that a bucket size of 32, and an iceberg parameter of d ≥ 4 are ideal for most
future experiments.
7.3 Steady State
The steady state of the system is the state at any time after physical memory
is fully (or 95% fully for a 5% LRU) mapped. As a result, any new insertion
necessitates a swap (unless a page in the correct bucket freelist gets freed by
the user). This will not cause any huge performance problems as in any fully
associative paging scheme, memory starts getting swapped at ≥95% anyways.
The real concern, instead, is that we don’t swap more due to iceberg hashing,
and to understand whether this LRU needs to be iceberg aware or not. This
brings up the question of what LRU scheme to use.
7.3.1 LRU Designs
Motivated by the problem faced by the steady state of a system, we have de-
signed a few LRU schemes that are intuitive to use with iceberg hashing. (Note
that none of these LRU schemes have a proven bound for resolving associativity
conflicts and this is the current topic of interest in the iceberg hashing research
group.)
The main LRU design being considered is called Horizon LRU and works
as follows:
When the system initially enters a steady state, a horizon condition at t = 5
is enforced by the LRU: Any pages below the horizon — in the oldest t% of
system memory — are evicted.
If during a page allocation of virtual page v an associativity conflict arises,
the oldest page p, in the possible mapping range of v, is evicted. Let p be in
the oldest t′% of pages, then we update the horizon condition as t = t′ .
The mapping range of a virtual address is defined as all the possible loca-
tions it can be mapped to — all physical pages assigned in freelistlevel1 in the
bucket returned by h()
⋃
All physical pages assigned in freelistlevel2 in all d
buckets returned by h2()
Another possible modification to this scheme is to mark all pages below the
horizon as ghost pages. Ghost pages remain invisible to iceberg hashing and
are lazily evicted: which means they are evicted only when their physical page
frame is demanded by iceberg hashing during a page allocation.
21
8 Current Research and Related Work
This section discusses the currently ongoing research on iceberg hashing as well
as all related work in this field.
8.1 Linux and the Second Prototype
A Linux 5.11.6 version of the iceberg hashing allocation scheme along with a
horizon LRU implementation is currently under development. Data from the
Linux prototype is not yet available and will be inserted into this thesis when
available.
8.2 Related Work
In the early 90’s Jerry Huck and Jim Hays introduced the hashed page table and
showed that it outperformed the traditional TLB and radix trees combinations
by splitting the work between hardware and software [19]. This validates our
idea of decoding the physical page address after fetching a minimal encoding
from the TLB or page table.
Idan Yaniv and Dan Tsafrir show that it is viable to use hashed paging
and hashed page tables over traditional radix trees in both bare metal and
virtualized setups[20]. This shows that using hash based paging can also help
speed up virtualized machines.
Thomas Barr, Alan Cox and Scott Rixner proved that utilizing using the
TLB to store entries from higher levels of a radix tree page table allows skipping
multiple levels of the page table, effectively speeding up address translation [21].
This effectively decreases TLB associativity, as now each TLB entry decodes to
a page table sub-tree. This is in fact strong evidence that lowering associativity
is a viable option for achieving high speed address translations.
9 Conclusion
This paper makes the case low-associativity paging is viable using iceberg hash-
ing. Iceberg hashing was designed through a close collaboration of theory and
systems researchers, to ensure that the design meets the practical (and arcane)
requirements of a real OS. The empirical evidence listed implies that reserving
only a small amount of spare capacity (< 5%) is sufficient to avoid the well-sung
problems resulting from associativity conflicts.
Current work extends promises a Linux prototype — representative of a
modern system — that observes no early evictions even in the steady state of
a system. Future work will investigate options for common requirements, such
as shared mappings, and we will more thoroughly evaluate interactions between
page lookups and the TLB.
22
References
[1] Timothy Merrifield and H. Reza Taheri. “Performance Implications of
Extended Page Tables on Virtualized x86 Processors”. In: Proceedings of
the12th ACM SIGPLAN/SIGOPS International Conference on Virtual
Execution Environments. VEE ’16. Atlanta, Georgia, USA: ACM, 2016,
pp. 25–35. isbn: 978-1-4503-3947-6. doi: 10.1145/2892242.2892258.
url: http://doi.acm.org.libproxy.lib.unc.edu/10.1145/2892242.
2892258.
[2] Skylake (client) - Microarchitectures - Intel. https://en.wikichip.org/
wiki/intel/microarchitectures/skylake_(client). 2016.
[3] Michael M. Swift. “Towards O(1) Memory”. In: Proceedings of the 16th
Workshop on Hot Topics in Operating Systems (HotOS). Whistler, BC,
Canada, 2017, pp. 7–11. doi: 10.1145/3102980.3102982. url: http:
//doi.acm.org/10.1145/3102980.3102982.
[4] Mel Gorman. Linux Huge Pages. https://lwn.net/Articles/375096/.
2010.
[5] Mel Gorman. AMD Zen Architecture. https://en.wikichip.org/wiki/
amd/microarchitectures/zen. 2018.
[6] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach,
and Emmett Witchel. “Coordinated and Efficient Huge Page Manage-
ment with Ingens”. In: 12th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 16). Savannah, GA: USENIX Associ-
ation, Nov. 2016, pp. 705–721. isbn: 978-1-931971-33-1. url: https :
/ / www . usenix . org / conference / osdi16 / technical - sessions /
presentation/kwon.
[7] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and
Michael M. Swift. “Efficient Virtual Memory for Big Memory Servers”.
In: Proceedings of the 40th Annual International Symposium on Computer
Architecture (ISCA). Tel-Aviv, Israel: ACM, 2013.
[8] K. Pagiamtzis and A. Sheikholeslami. “Content-addressable memory
(CAM) circuits and architectures: a tutorial and survey”. In: IEEE Jour-
nal of Solid-State Circuits 41.3 (2006), pp. 712–727. doi: 10.1109/JSSC.
2005.864128.
[9] V. Karakostas, J. Gandhi, A. Cristal, M. D. Hill, K. S. McKinley, M. Ne-
mirovsky, M. M. Swift, and O. S. Unsal. “Energy-efficient address trans-
lation”. In: 2016 IEEE International Symposium on High Performance
Computer Architecture (HPCA). 2016, pp. 631–643. doi: 10.1109/HPCA.
2016.7446100.
[10] Andrea Arcangeli. “Transparent hugepage support”. In: KVM forum.
Vol. 9. 2010.
23
[11] Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. “Practical,
Transparent Operating System Support for Superpages”. In: SIGOPS
Oper. Syst. Rev. 36.SI (Dec. 2003), pp. 89–104. issn: 0163-5980. doi: 10.
1145/844128.844138. url: https://doi.org/10.1145/844128.844138.
[12] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and
Michael M. Swift. “Efficient virtual memory for big memory servers”. In:
Proceedings of the 40th Annual International Symposium on Computer Ar-
chitecture - ISCA 13. ACM Press, 2013. doi: 10.1145/2485922.2485943.
url: https://doi.org/10.1145%2F2485922.2485943.
[13] Chang Hyun Park, Taekyung Heo, Jungi Jeong, and Jaehyuk Huh. “Hy-
brid TLB Coalescing: Improving TLB Translation Coverage under Diverse
Fragmented Memory Allocations”. In: Proceedings of the 44th Annual
International Symposium on Computer Architecture. ISCA ’17. Toronto,
ON, Canada: Association for Computing Machinery, 2017, pp. 444–456.
isbn: 9781450348928. doi: 10 . 1145 / 3079856 . 3080217. url: https :
//doi.org/10.1145/3079856.3080217.
[14] Jian Huang, Moinuddin K. Qureshi, and Karsten Schwan. “An Evolu-
tionary Study of Linux Memory Management for Fun and Profit”. In:
2016 USENIX Annual Technical Conference (USENIX ATC 16). Denver,
CO: USENIX Association, June 2016, pp. 465–478. isbn: 978-1-931971-
30-0. url: https://www.usenix.org/conference/atc16/technical-
sessions/presentation/huang.
[15] Dimitrios Skarlatos, Apostolos Kokolis, Tianyin Xu, and Josep Torrellas.
“Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for
Parallelism”. In: Proceedings of the Twenty-Fifth International Confer-
ence on Architectural Support for Programming Languages and Operating
Systems. ASPLOS ’20. New York, NY, USA: Association for Computing
Machinery, 2020, pp. 1093–1108. doi: 10.1145/3373376.3378493. url:
https://doi.org/10.1145/3373376.3378493.
[16] Martin Raab and Angelika Steger. ““Balls into bins”—A simple and tight
analysis”. In: International Workshop on Randomization and Approxima-
tion Techniques in Computer Science. Springer. 1998, pp. 159–170.
[17] Petra Berenbrink, Artur Czumaj, Angelika Steger, and Berthold Vöcking.
“Balanced Allocations: The Heavily Loaded Case”. In: Proceedings of the
Thirty-Second Annual ACM Symposium on Theory of Computing. STOC
’00. Portland, Oregon, USA: Association for Computing Machinery, 2000,
pp. 745–754. isbn: 1581131844. doi: 10 . 1145 / 335305 . 335411. url:
https://doi.org/10.1145/335305.335411.
[18] Berthold Vöcking. “How asymmetry helps load balancing”. In: Journal of
the ACM (JACM) 50.4 (2003), pp. 568–589.
24
[19] Jerry Huck and Jim Hays. “Architectural Support for Translation Table
Management in Large Address Space Machines”. In: Proceedings of the
20th Annual International Symposium on Computer Architecture. ISCA
’93. San Diego, California, USA: ACM, 1993, pp. 39–50. isbn: 0-8186-
3810-9. doi: 10.1145/165123.165128. url: http://doi.acm.org/10.
1145/165123.165128.
[20] Idan Yaniv and Dan Tsafrir. “Hash, Don’t Cache (the Page Table)”. In:
Proceedings of the 2016 ACM SIGMETRICS International Conference on
Measurement and Modeling of Computer Science. 2016, pp. 337–350. doi:
10.1145/2896377.2901456. url: https://doi.org/10.1145/2896377.
2901456.
[21] Thomas W. Barr, Alan L. Cox, and Scott Rixner. “Translation Caching:
Skip, Don’T Walk (the Page Table)”. In: Proceedings of the 37th An-
nual International Symposium on Computer Architecture. ISCA ’10. Saint-
Malo, France: ACM, 2010, pp. 48–59. isbn: 978-1-4503-0053-7. doi: 10.
1145/1815961.1815970. url: http://doi.acm.org/10.1145/1815961.
1815970.
25
