Cache-Oblivious VAT-Algorithms by Jurkiewicz, Tomasz et al.
Cache-Oblivious VAT-Algorithms
Tomasz Jurkiewicz∗ Kurt Mehlhorn† Patrick Nicholson‡
March 23, 2019
Abstract
The VAT-model (virtual address translation model) extends the EM-model (external memory model)
and takes the cost of address translation in virtual memories into account. In this model, the cost of a
single memory access may be logarithmic in the largest address used. We show that the VAT-cost of
cache-oblivious algorithms is only a constant factor larger than their EM-cost; this requires a somewhat
more stringent tall cache assumption than for the EM-model.
1 Introduction
Modern processors have a memory hierarchy and use virtual memory. We concentrate on two-levels of
the hierarchy and refer to the faster memory as the cache. Data is moved between the fast and the slow
memory in blocks of contiguous memory cells, and only data residing in the fast memory can be accessed
directly. Whenever data in the slow memory is accessed, a cache fault occurs and the block containing the
data must be moved to the fast memory. In the EM-model of computation, the complexity of an algorithm
is defined as the number of cache faults.
In general, many processes are running concurrently on the same machine. Each running process has
its own linear address space 0, 1, 2, . . . , and the operating system ensures that the distinct linear address
spaces of the distinct processes are mapped injectively to the physical address space of the processor. To
this effect, the operating system maintains a translation tree for each process. The translation from virtual
addresses to physical addresses is implemented as a tree walk and incurs cost; see Section 3 for details.
The depth of the tree for a particular process is d = logK(m/P) where m is the maximum address used by
the process, K is the arity of the tree and P is the page size. Typically, K ≈ P≈ 210, and K is chosen such
that the space requirement of a translation tree node is equal to the size of a page. The translation process
accesses d pages and only pages residing in fast memory can be accessed directly. Any node visited during
the translation process must be brought into fast memory if not already there, and hence a single memory
access may cause up to d cache faults.
The cost of the translation process is clearly noticeable in some cases. Jurkiewicz and Mehlhorn [JM13]
timed some simple programs and observed that for some of them the quotient
measured running time for input size n/RAM-running time for input size n
seems to grow logarithmically in n,; see Figure 1. Jurkiewicz and Mehlhorn introduced the VAT-model as
an extension of the EM-model to account for the cost of address translation; see Section 3 for a definition of
their model. They showed that the growth rates of the measured running times of the programs mentioned
in Figure 1 are correctly predicted by the model.
The EM-model penalizes non-locality and the VAT-model penalizes it even more. Cache-oblivious
algorithms [FLPR12] show good data locality for all memory sizes. Jurkiewicz and Mehlhorn [JM13]
showed that some cache-oblivious algorithms, namely those that do no need a tall cache assumption, also
∗Google Zürich, email: tomasz.tojot.jurkiewicz@gmail.com
†MPI for Informatics, email: mehlhorn@mpi-inf.mpg.de
‡MPI for Informatics, email: pnichols@mpi-inf.mpg.de
1
ar
X
iv
:1
40
4.
35
77
v1
  [
cs
.D
S]
  1
4 A
pr
 20
14
ru
nn
in
g
tim
e/
R
A
M
co
m
pl
ex
ity
0	  
20	  
40	  
60	  
80	  
100	  
120	  
140	  
160	  
9	   10	   11	   12	   13	   14	   15	   16	   17	   18	   19	   20	   21	   22	   23	   24	   25	   26	   27	   28	   29	   30	   31	  
permute	  
random	  access	  
binsearch	  
heapsort	  
heapify	  
introsort	  
sequen>al	  access	  
log(input size)
Figure 1: The abscissa shows the logarithm of the input size. The ordinate shows the measured running
time divided by the RAM-complexity (normalized operation time). RAM complexity of the programs
shown is either cn or cn logn; [JM13] ignores lower order terms. The constants c were chosen such that
the different plots fit nicely into the same figure. The normalized operation times of sequential access,
quicksort (introsort), and heapify are constant, the normalized operation times of permute, random scan,
repeated binary search, heapsort grow as functions of the problem size. Note that a straight-line corresponds
to a linear function in log(input size).
perform well in the VAT-model. Their paper poses the question of whether a similar statement can be made
for the much larger class of cache-oblivious algorithms that require a tall cache assumption. We answer
their question in the affirmative.
Our main result is as follows. Consider a cache-oblivious algorithm that incursC(M˜, B˜,n) cache faults,
when run on a machine with cache size M˜ and block size B˜, provided that M˜ ≥ g(B˜). Here g : N 7→ N is a
function that captures the “tallness” requirement on the cache [FLPR12]. We consider the execution of the
algorithm on a VAT-machine with cache size M and page size P and show that the number of cache faults
is bounded by 4dC(M/4,dB,n) provided that M ≥ 4g(dB). Here M =M/a, B= P/a and a≥ 1 is the size
(in addressable units) of the items handled by the algorithm.
Funnel sort [FLPR12] is an optimal cache-oblivious sorting algorithm. On an EM-machine with cache
size M˜ and block size B˜, it sorts n items with
C(M˜, B˜,n) = O
(
n
B˜
⌈
logn/M˜
logM˜/B˜
⌉)
cache faults provided that M˜ ≥ B˜2: thus g is quadratic1 for this algorithm. As a consequence of our main
theorem, we obtain:
Theorem 1 Funnel sort sorts n items, each of size a≥ 1 addressable units, on a VAT-machine with cache
1This constraint can be reduced to M˜ ≥ B˜1+ε for any constant ε > 0 [BF02], however we do not wish to introduce additional
notation.
2
size M and page size P, with at most
O
(
4n
B
⌈
log4n/M
logM/(4dB)
⌉)
cache faults, where M =M/a and B= P/a. This assumes (B logK(2n/P))
2 ≤M/4.
Since M/(4dB)≥ (M/B)1/2 for realistic values of M, B, K, and n, this implies funnel-sort is essentially
optimal also in the VAT-model.
2 The EM-Model
An EM-machine has a fast memory of size M˜ and a data is moved between fast and slow memory in blocks
of size B˜. Algorithms for EM-machines may use M˜ and B˜ in the program code; algorithms are not written
for specific values of M˜ and B˜, but work for any values of M˜ and B˜ satisfying certain constraints, e.g., that
the fast memory can hold a certain number of blocks. We capture these constraints by a function g :N 7→N
and the requirement M˜ ≥ g(B˜).
Cache-oblivious algorithms are algorithms that do not refer to the parameters M˜ and B˜ in the code. Only
the analysis is done in terms of M˜ and B˜. Frequently, the analysis only holds for M˜ and B˜ satisfying certain
constraints; e.g., that the cache is tall and satisfies M˜ ≥ B˜2. Again, this can be captured by an appropriate
function g.
It is customary in the EM-literature that the size of the fast memory and the size of a block are expressed
in terms of number of items handled by the algorithm. For example, for a sorting algorithm the items are
the objects to be sorted.
3 The VAT-Model [JM13, Section XXX]
VAT-machines are EM-machines that use virtual addresses. We concentrate on the virtual memory of a
single program. Both real (physical) and virtual addresses are strings in {0,K− 1}d{0,P− 1}. Any such
string corresponds to a number in the interval [0,KdP− 1] in a natural way. The {0,K− 1}d part of the
address is called an index, and its length d is an execution parameter fixed prior to the execution. We
assume d = dlogK(last used address/P)e. The {0,P− 1} part of the address is called page offset and P
is the page size. A page contains P addressable units, usually bytes.2 The translation process is a tree
walk. We have a K-ary tree T of height d. The nodes of the tree are pairs (`, i) with ` ≥ 0 and i ≥ 0. We
refer to ` as the layer of the node and to i as the number of the node. The leaves of the tree are on layer
zero and a node (`, i) on layer ` ≥ 1 has K children on layer `− 1, namely the nodes (`− 1,Ki+ a), for
a = 0 . . .K−1. In particular, node (d,0), the root, has children (d−1,0), . . . , (d−1,K−1). The leaves
of the tree correspond to physical pages of the main memory of a RAM machine. In order to translate a
virtual address xd−1 . . .x0y, we start in the root of T , and then follow the path described by xd−1 . . .x0. We
refer to this path as the translation path for the address. The path ends in the leaf (0,∑06i6d−1 xiKi). Then
the offset y selects the y-th cell in this page.
A VAT-machine has a fast memory (cache) of size M and data is moved between fast and slow memory
in units of P cells. We assume that a node of the translation tree fits into a page. Only nodes and data
pages in the cache can be accessed directly, nodes and data pages currently in slow memory have to be
brought into fast memory before they can be accessed. More precisely, let a be a virtual address, and let
vd ,vd−1, . . . ,v0 be its translation path. Here, vd is the root of the translation tree, vd to v1 are internal nodes
of the translation tree, and v0 is a data page. We assume that the root node is always in cache. Translating
a requires accessing all nodes of the translation path in order. Only nodes in the cache can be accessed.
If a node needs to be accessed, but is not in the cache, it needs to be added to the cache, and some other
2In actual systems K is chosen such that a node fits exactly into a page. For example, for the 64-bit addressing mode of the
processors of the AMD64 family (see http://en.wikipedia.org/wiki/X86-64), the addressable units are bytes and P = 212. Since an
address consists of 23 bytes, K = 29.
3
data pages
translation tree
y
Figure 2: The pages holding the data are shown at the bottom and the translation tree is shown above the
data pages. The translation tree has fan-out K and depth d; here K = 2 and d = 3. The translation path for
the virtual index 100 is shown. The offset y selects a cell in the physical page with virtual index 100. The
nodes of the translation tree and the data pages are stored in memory. Only nodes and data pages in fast
memory (cache memory) can be accessed directly, nodes and data pages currently in slow memory have to
be brought into fast memory before they can accessed. Each such move is a cache fault. In the EM-model
only cache faults for data pages are counted, in the VAT-model, we count cache faults for nodes of the
translation tree and for data pages.
page has to be evicted. The translation of a ends when v0 is accessed. The cost of the memory access is the
number of page faults incurred during the translation process.
EM- and VAT-machines move cells in contiguous blocks. If the items handled by an EM-machine
comprise a ≥ 1 addressable units, we have M = M/a and B = P/a. In the EM cost model only cache
faults for data pages are charged and in the VAT cost model all cache faults arising in a memory access are
charged.
4 EM-Algorithms as VAT-Algorithms
In the worst case, a memory access causes one cache fault in the EM-model and d cache faults in the VAT-
model. In order to amortize the cost of a cache fault in the EM-model, it suffices to access a significant
fraction of the memory cells in each page brought to fast memory. Therefore, an EM-algorithm that is
efficient for block size dP/a should also be efficient in the VAT-model. The following discussion captures
this intuition.
For an EM-algorithm, let C(M˜, B˜,n) be the number of IO-operations on an input of size n, where M˜ is
the size of the faster memory (also called cache memory), B˜ is the block size, and M˜≥ g(B˜). The following
theorem shows that any EM-algorithm that is aware of M˜ and B˜ implies the existence of an VAT-algorithm
that is aware of M and P.
Theorem 2 Let g :N 7→N. Consider an EM-algorithm with IO-complexity C(M˜, B˜,n), where M˜ is the size
of the cache, B˜ is the size of a block, and n is the input size, provided that M˜≥ g(B˜). Let d = logK(n/P) and
assume g(dP/a)≤M/(4a), where a≥ 1 is the item size in number of addressable units. The program can
be made to run on a VAT-machine with cache size M and page size P with at most 4dC(M/(4a),dP/a,n)
cache faults. With the notation M =M/a and B= P/a, the upper bound can be stated as 4dC(M/4,dB,n)
and the tallness requirement becomes M ≥ 4g(dB).
Proof: We run the EM-algorithm with a cache size of M˜ =M/(4a) and a block size B˜= dP/a and show
how to execute it efficiently in a VAT-machine with page size P and cache size M. We use one-fourth of
4
the cache for data and three-fourth for nodes of the translation tree. Since
M˜ =M/(4a)≥ g(dP/a) = g(B˜),
the number of cache faults incurred by the EM-algorithm is at most C(M˜, B˜,n).
Whenever, the EM-algorithm moves a block containing B˜ items) to its data cache, the VAT-machine
moves the corresponding d pages to the data cache and also moves all internal nodes of the translation paths
to these d pages to then translation cache. The number of internal nodes is bounded by 2d+∑i≥1 d/Ki ≤
2d+ d/(K− 1) ≤ 3d. Thus a translation cache of size 3M/4 suffices to store the translation paths to all
pages in the data cache. For every cache fault of the EM-model, the VAT-machine incurs 4d cache faults.
The theorem follows.
We apply the theorem to multi-way mergesort, an optimal sorting algorithm in the EM-model. It first
creates n/M˜ sorted runs of size M˜ each by sorting chucks of size M˜ in internal memory. It then performs
multi-way merge sort on these runs. For the merge, it keeps one block from each input run, one block of
the output run, and a heap containing the first elements of each run in fast memory. If M˜ ≥ 5B˜, we can
merge two sequences since the space for the heap is certainly no more that the space for the input runs.
The scheme results in a merge factor of Θ(M˜/B˜). We assume for simplicity that the factor is exactly M˜/B˜.
Thus
C(M˜, B˜,n) =
n
B˜
(
1+
⌈
logn/M˜
logM˜/B˜
⌉)
.
By Theorem 2 and with a cache size of M and page size P, the number of cache faults in the VAT-model
is at most
4dC(M/(4a),dP/a,n) =
4n
P/a
(
1+
⌈
log4an/M
logM/(4dP)
⌉)
=
4n
B
(
1+
⌈
log4n/M
logM/(4dB)
⌉)
Here a is the item size in number of addressable units, M =M/a and B= P/a. This assumes 5dP≤M/4.
If M/(4dB) ≥ (M/B)1/2, the asymptotic number of cache faults is the same in both models. For realistic
values of M, B, and n, this will be the case.
5 Cache-Oblivious VAT-Algorithms
We next extend the theorem to cache-oblivious algorithms. Unlike the previous theorem, the following
theorem indicates that any algorithm, regardless of whether it is aware of the cache and block sizes, implies
the existence of a VAT-algorithm that is similarly oblivious to M and P.
Theorem 3 Let g : N 7→ N. Let A be an algorithm that incurs C(M˜, B˜,n) cache faults in the EM-model
with cache size M˜ and block size B˜ on an input of size n, provided that M˜ ≥ g(B˜). Let d = logK(n/P)
and assume g(dP/a) ≤M/(4a), where a ≥ 1 is the item size in number of addressable units. On a VAT-
machine with cache size M and page size P and optimal use of the cache, the program incurs at most
4dC(M/(4a),dP/a,n). With the notation M = M/a and B = P/a, the upper bound can be stated as
4dC(M/4,dB,n) and the tallness requirement becomes M/4≥ g(dB).
Proof: We can almost literally reuse the proof of Theorem 2. The optimal execution of A on a VAT-
machine with cache size M and page size P cannot incur more cache faults than the particular execution
that we describe next.
Consider an execution of A on an EM-machine with cache size M˜ =M/(4a) and block size B˜= dP/a.
Since
M˜ =M/(4a)≥ g(dP/a) = g(B˜),
the number of cache faults incurred by the EM-algorithm is at most C(M˜, B˜,n). We execute the program
on the VAT-machine as in the proof of Theorem 2.
5
We use one-fourth of the cache for data and three-fourth for nodes of the translation tree. Whenever, the
EM-machine moves a block (of size B˜ items) to its data cache, the VAT-machine moves the corresponding
d pages to the data cache and also moves all internal nodes of the translation paths to these d pages to then
translation cache. The number of internal nodes is bounded by 2d+∑i≥1 d/Ki ≤ 2d+ d/(K− 1) ≤ 3d.
Thus a translation cache of size 3M/4 suffices to store the translation paths to all pages in the data cache.
For every cache fault of the EM-machine, the VAT-machine incurs 4d cache faults. The theorem follows.
The optimal cache replacement strategy may be replaced by LRU at the cost of doubling the cache size
and doubling the number of cache faults [JM13].
Recall that a cache-oblivious algorithm has no knowledge of the memory and the block size. It does
well for any choice of M˜ and B˜ as long as M˜ ≥ g(B˜). The theorem tells us that it also does well in the
VAT-model as long as the more stringent requirement M ≥ 4ag(dB) is satisfied.
6 Conclusion
We have shown that performance of cache-oblivious algorithms in the VAT-model matches their perfor-
mance in the EM-model provided a somewhat more stringent tall cache assumption holds.
References
[BF02] Gerth Stølting Brodal and Rolf Fagerberg. Cache oblivious distribution sweeping. In ICALP,
volume 2380 of LNCS, pages 426–438, 2002.
[FLPR12] M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. ACM
Transactions on Algorithms, pages 4:1 – 4:22, 2012. a preliminary version appeared in FOCS
1999.
[JM13] Tomasz Jurkiewicz and Kurt Mehlhorn. The Cost of Address Translation. In ALENEX, pages
148–162, 2013. Full version available at http://arxiv.org/abs/1212.0703.
6
