Modern computers are not Random Access Machines (RAMs). They have a memory hierarchy, multiple cores, and a virtual memory. We address the computational cost of the address translation in the virtual memory. The starting point for our work on virtual memory is the observation that the analysis of some simple algorithms (random scan of an array, binary search, heapsort) in either the RAM model or the External Memory (EM) model does not correctly predict growth rates of actual running times. We propose the Virtual Address Translation (VAT) model to account for the cost of address translations and analyze the algorithms mentioned and others in the model. The predictions agree with the measurements. We also analyze the VAT-cost of cache-oblivious algorithms.
INTRODUCTION
The role of models of computation in algorithmics is to provide abstractions of real machines for algorithm analysis. Models should be mathematically pleasing and have a predictive value. Both aspects are essential. If the analysis has no predictive value, it is merely a mathematical exercise. If a model is not clean and simple, researchers will not use it. The standard models for algorithm analysis are the Random Access Machine (RAM) model [Shepherdson and Sturgis 1963] and the External Memory (EM) model [Aggarwal and Vitter 1988] .
The RAM model is by far the most popular. It is an abstraction of the von Neumann architecture. A computer consists of a control and processing unit and an unbounded memory. Each memory cell can hold a word, and memory access as well as logical and arithmetic operations on words take constant time. The word length is either an explicit parameter or assumed to be logarithmic in the size of the input. The model is very simple and has a predictive value.
A preliminary version of this article appeared in ALENEX 2013. The article is based on the first author's PhD thesis. Authors' addresses: T. Jurkiewicz, Google, Zürich, Switzerland; K. Mehlhorn, MPI Informatik, Campus E1.3, 66123 Saarbrücken, Germany. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested fromModern machines have virtual memory, multiple processor cores, and an extensive memory hierarchy involving several levels of cache memory, main memory, and disks. The EM model was introduced because the RAM model does not account for the memory hierarchy, and hence, the RAM model has no predictive value for computations involving disks. We give more details on both models in Section 2.
This research started with a simple experiment. We timed six simple programs for different input sizes, namely, permuting the elements of an array of size n, random scan of an array of size n, n random binary searches in an array of size n, heapsort of n elements, introsort 1 of n elements, and sequential scan of an array of size n. For some of the programs (e.g., sequential scan through an array and quicksort), the measured running times agree very well with the predictions of the models. However, the running time of random scan seems to grow as O(n log n), and the running time of the binary searches seems to grow as O(n log 2 n), a blatant violation of what either model predicts. We give the details of the experiments in Section 3.
Why do measured and predicted running times differ? Modern computers have virtual memories. Each process has its own virtual address space {0, 1, 2, . . .}. Whenever a process accesses memory, the virtual address has to be translated into a physical address. The translation of virtual addresses into physical addresses incurs cost. The translation process is usually implemented as a hardware-supported walk in a prefix tree (see Section 4 for details). The tree is stored in the memory hierarchy, and hence, the translation process may incur cache faults. The number of cache faults depends on the locality of memory accesses: the less local, the more cache faults. The depth of the translation tree is logarithmic in the size of an algorithm's address space and, hence, in the worst case, every memory access may lead to a logarithmic number of cache faults during the translation process. For random scan and random binary searches, it apparently does.
We propose an extension of the EM model, the Virtual Address Translation (VAT) model, that accounts for the cost of address translation in Section 5. We show that we may assume that the translation process makes optimal use of the cache memory by relating the cost of optimal use with the cost under the Least Recently Used (LRU) strategy (see Section 5) . We analyze a number of programs, including the six mentioned, in the VAT model and obtain good agreement with the measured running times in Section 6. We relate the cost of a cache-oblivious algorithm in the EM model to the cost in the VAT model in Section 7. In particular, cache-oblivious algorithms that do not need a tall-cache assumption incur no or little overhead. In Section 8, we address comments made by reviewers and readers of the article. We close with some suggestions for further research and consequences for teaching in Section 9.
Related Work. It is well-known in the architecture and systems community that virtual memory and address translation comes at a cost. Many textbooks on computer organization [Hennessy and Patterson 2007] , discuss virtual memories. The papers by Drepper [2007 Drepper [ , 2008 describe computer memories, including virtual translation, in great detail. Advanced Micro Devices [2010] provides further implementation details.
The cost of address translation has received little attention from the algorithms community. The survey paper by Rahman [2003] on algorithms for hardware caches and the Translation Lookaside Buffer (TLB) summarizes the work on the subject. She discusses a number of theoretical models for memory. All models discussed in Rahman [2003] treat address translation atomically; that is, the translation from virtual to physical addresses is a single operation. However, this is no longer true. In 64-bit systems, the translation process is a tree walk. Our article is the first that proposes a theoretical model for address translation and analyzes algorithms in this model.
THE RANDOM ACCESS MACHINE AND THE EXTERNAL MEMORY MACHINE
A RAM machine consists of a central processing unit and a memory. The memory consists of cells indexed by nonnegative integers. A cell can hold a bitstring. The CPU has a finite number of registers, in particular an accumulator and an address register. In any one step, a RAM can either perform an operation (simple arithmetic or boolean operations) on its registers or access memory. In a memory access, the content of the memory cell indexed by the content of the address register is either loaded into the accumulator or written from the accumulator. Two timing models are used: in the unit-cost RAM, each operation has cost one, and the length of the bitstrings that can be stored in memory cells and registers is bounded by the logarithm of the size of the input; in the logarithmic-cost RAM, the cost of an operation is equal to the sum of the lengths (in bits) of the operands, and the contents of memory cells and registers are unrestricted.
An EM machine is a RAM with two levels of memory. The levels are referred to as cache and main memory or memory and disk, respectively. We use the terms cache and main memory. The CPU can only operate on data in the cache. Cache and main memory are each divided into blocks of B cells, and data are transported between cache and main memory in blocks. The cache has size M and hence consists of M/B blocks; the main memory is infinite in size. The analysis of algorithms in the EM model bounds the number of CPU steps and the number of block transfers. The time required for a block transfer is equal to the time required by (B) CPU steps. The hidden constant factor is fairly large, and, therefore, the emphasis of the analysis is usually on the number of block transfers.
SOME PUZZLING EXPERIMENTS
We used the following seven programs in our experiments. Let A be an array of size n. On a RAM, the first two, the last, and heapify are linear time (n), and the others are (n log n). Figure 1 shows the measured running times for these programs divided by their RAM complexity; we refer to this quantity as normalized operation time. More details about our experimental methodology are available in Section 3.2. If RAM complexity is a good predictor, the normalized operation times should be approximately constant. We observe that two of the linear time programs show linear behavior (namely, sequential access and heapify), that one of the (n log n) programs shows (nlog n) behavior (namely, quicksort), and that for the other programs (heapsort, repeated binary search, permute, random access), the actual running time grows faster than what the RAM model predicts. How much faster and why? Figure 1 also answers the "how much faster" part of the question. Normalized operation time seems to be a piecewise linear in the logarithm of the problem size; observe that we are using a logarithmic scale for the abscissa in this figure. For heapsort and repeated binary search, normalized operation time is almost perfectly piecewise linear; for permute and random scan, the piecewise linear should be taken with a grain of salt. 2 The pieces correspond to the memory hierarchy. The measurements suggest that the running times of permute and random scan grow like (n log n), and the running times of heapsort and repeated binary search grow like (n log 2 n).
Memory Hierarchy Does Not Explain It
We argue in this section that the memory hierarchy does not explain the experimental findings. We give a detailed analysis of the cost of a random scan of an array of size n in a hierarchical memory and relate it to the measured running time. We see that the prediction by the model and the measured running times differ widely. A simpler argument for a one-level memory hierarchy is given in Section 5.1. Let s i , i 0 be the size of the i-th level C i of the memory hierarchy; s −1 = 0. We assume C i ⊂ C i+1 for all i. Let be such that s < n s +1 (i.e., the array fits into level + 1 but does not fit into level ). For i , a random address is in C i but not in C i−1 , with probability (s i − s i−1 )/n. Let c i be the cost of accessing an address that is in C i but not in C i−1 . The expected total cost in the external memory model is equal to
This is a piecewise linear function whose slope is c +1 for s < n s +1 . The slopes are increasing but change only when a new level of the memory hierarchy is used. Figure 2 shows the measured running time of random scan divided by EM complexity Fig. 2 . The running time of random scan divided by the EM complexity. We used the following parameters for the memory hierarchy (the sizes are taken from the machine specification, and the access times were determined experimentally):
as a function of the logarithm of the problem size. Clearly, the figure does not show the graph of a constant function. Programs used for the preparation of Figure 1 were compiled by gcc in version Debian 4.4.5-8 and run on Debian Linux in version 6.0.3, on a machine with an Intel Xeon X5690 processor (3.46GHz, 12MiB 4 Smart Cache, 6.4GT/s QPI). The caption of Figure 2 lists further machine parameters. In each case, we performed multiple repetitions and took the minimum measurement for each considered size of the input data. We chose the minimum because we are estimating the cost that must be incurred. We also experimented with average or median; moreover, we performed the experiments on other machines and operating systems and obtained consistent results in each case. We grew input sizes by factors of 1.4 to exclude the influence of memory associativity, and we made sure that the largest problem size still fitted in the main memory to eliminate swapping.
For each experiment, we computed its normalized operation time, which we define as the measured execution time divided by the RAM complexity. In this way, we eliminate the known factors. The resulting function represents the cost of a single RAM operation in relation to the problem size.
We use semi-log plots to show normalized operation cost as a function of the logarithm in the input size. In such a plot, linear functions of the logarithm of the input size are easily identified as straight lines.
VIRTUAL MEMORY
Virtual addressing was motivated by multiprocessing. When several processes are executed concurrently on the same machine, it is convenient and more secure to give each program a linear address space indexed by the nonnegative integers. However, these addresses are now virtual and no longer directly correspond to physical (real) addresses. Rather, it is the task of the operating system to map the virtual addresses of all processes to a single physical memory. The mapping process is hardware supported.
The memory is viewed as a collection of pages of P = 2 p cells (=addressable units). Both virtual and real addresses consist of an index and an offset. The index selects a page, and the offset selects a cell in a page. The index is broken into d segments of length k = log K. For example, for the long addressing mode of the processors of the AMD64 family (see http://en.wikipedia.org/wiki/X86-64) the numbers are: d = 4, k = 9, and p = 12; the remaining 16 bits are used for other purposes. The choice of k is not arbitrary. A page consists of 2 12 bytes. An address consists of 8 bytes; hence, a node of the translation tree requires 2 9 · 2 3 = 2 12 bytes. Thus nodes fit exactly into pages. Logically, the translation process is a walk in a tree with outdegree K; this tree is usually called the page table [Drepper 2008; Hennessy and Patterson 2007] . The walk starts at the root; the first segment of the index determines the child of the root, the second segment of the index determines the child of the child, and so on. The leaves of the tree store indices of physical pages. The offset then determines the cell in the physical address (i.e., offsets are not translated but taken verbatim Figure 3 on page 7 shows an overview of the page-translation hierarchy used in long mode. Legacy mode paging uses a subset of this translation hierarchy. As this figure shows, a virtual address is divided into fields, each of which is used as an offset into a translation table. The complete translation chain is made up of all table entries referenced by the virtual-address fields. The lowest-order virtual-address bits are used as the byte offset into the physical page.
Due to its size, the page table is stored in the RAM, but nodes accessed during the page table walk have to be brought to faster memory. A small number of recent translations is stored in the TLB. The TLB is a small associative memory that contains physical indices indexed by the virtual ones. This is akin to the first-level cache for data. Quoting Advanced Micro Devices [2010] further:
Every memory access has its virtual address automatically translated into a physical address using the page-translation hierarchy. Translation-lookaside buffers (TLBs), also known as page-translation caches, nearly eliminate the performance penalty associated with page translation.TLBs are special Table; and  Page Table) .
on-chip caches that hold the most-recently used virtual-to-physical address translations. Each memory reference (instruction and data) is checked by the TLB. If the translation is present in the TLB, it is immediately provided to the processor, thus avoiding external memory references for accessing page tables. TLBs take advantage of the principle of locality. That is, if a memory address is referenced, it is likely that nearby memory addresses will be referenced in the near future.
VAT, THE VIRTUAL ADDRESS TRANSLATION MODEL
VAT machines are RAM machines that use virtual addresses. We concentrate on the virtual memory of a single program. Both real (physical) and virtual addresses are It is assumed that d = log K (maximum used virtual address/P) . The {0, . . . , P − 1} part of the address is called page offset, and P is the page size. The translation process is a tree walk. We have a K-ary tree T of height d. The nodes of the tree are pairs ( , i) with 0 and i 0. We refer to as the layer of the node and to i as the number of the node. The leaves of the tree are on layer zero, and a node ( , i) on layer 1 has K children on layer − 1; namely, the nodes (
. The leaves of the tree are physical pages of the main memory of a RAM machine. In order to translate virtual address x d−1 . . . x 0 y, we start in the root of T and then follow the path described by The number of cache faults incurred by the memory access is the number of insertions performed during the translation process, and the cost of the memory access is τ times the number of cache faults. The number of cache faults is at least the number of nodes of the translation path that are not present in the cache at the beginning of the translation. Figure 4 summarizes the notation.
We close the introduction of the model with a trivial, but useful observation. 
Relation to EM Model
In the EM model, one counts only cache faults caused by data pages; in the VAT model, one counts cache faults caused by all translation tree nodes.
A comparison of jumping scan and random scan illustrates the difference; this example is provided by Jirka Kataijanen (personal communication). A jumping scan (with stride P) is a sequential scan of an array with stride P (i.e., cells 0, P, 2P, 3P, . . . are accessed in order). In the EM model, each access causes one page fault and hence the EM cost of jumping scan and random scan is identical. Fig. 5 . The pages holding the data are shown at the bottom and the translation tree is shown above the data pages. The translation tree has fan-out K and depth d; here, K = 2 and d = 3. The translation path for the virtual index 100 is shown. The offset y selects a cell in the physical page with virtual index 100. The nodes of the translation tree and the data pages are stored in memory. Only nodes and data pages in fast memory (cache memory) can be accessed directly; nodes and data pages currently in slow memory have to be brought into fast memory before they can accessed. Each such move is a cache fault. In the EM model, only cache faults for data pages are counted; in the VAT model, we count cache faults for all nodes of the translation tree.
In the VAT model, a random scan is much more costly than a jumping scan. By Lemma 6.1, the cost of a random scan of an array of size n is at least τ nlog K (n/(PW)). However, the cost of a jumping scan is at most τ n(1 + 1/(K − 1)). Observe (see Figure 5 ) that K subsequent data pages share the last internal of the translation path, and hence the number of page faults caused by nodes of the translation tree is bounded by
. A second comparison is also informative. Assume that pages are accessed in random order, and, once a page is accessed, all cells of the page are accessed. With respect to cache faults, n such accesses are equivalent to n/P random accesses. In the EM model, the number of cache faults is n/P, which is the same as for a linear scan; in the VAT model, the number is at least (n/P) log K (n/(P 2 W)).
TC Replacement Strategies
Since the TC is a special case of a cache in a classic EM machine, the following classic result applies.
LEMMA 5.2 ([SLEATOR AND TARJAN 1985; FRIGO ET AL. 2012]). An optimal replacement strategy is at most by factor 2 better than LRU
5 on a cache of double size, assuming both caches start empty.
This result is useful for upper and lower bounds. LRU is easy to implement. In upper bound arguments, we may use any replacement strategy and then appeal to the Lemma. In lower bound arguments, we may assume the use of LRU. For TC caches, it is natural to assume the initial segment property.
Definition 5.3. An initial segment of a rooted tree is an empty tree or a connected subgraph of the tree containing the root. A TC has the Initial Segment Property (ISP) if the TC contains an initial segment of the translation tree. A TC replacement strategy has ISP if, under this strategy, a TC has ISP at all times.
PROPOSITION 5.4. Strategies with ISP exist only for TCs with W > d.
ISP is important because, as we show later, ISP can be realized at no additional cost for LRU and at little additional cost for the optimal replacement strategy. Therefore, strategies with ISP can significantly simplify proofs for upper and lower bounds. Moreover, strategies with ISP are easier to implement. Any implementation of a caching system requires some way to search the cache. This requires an indexing mechanism. RAM memory is indexed by the memory translation tree. In the case of the TC itself, ISP allows us to integrate the indexing structure into the cached content. One only has to store the root of the tree at a fixed position or store the location of the root in a special register, called the Page Map Base Register in Figure 3 .
We establish the following relations in this section:
here, W is the size of the translation cache, d is the index length, LRU is the leastrecently used replacement strategy, MIN is the optimal cache replacement strategy, ISLRU is LRU with the ISP property, and ISMIN is the optimal replacement strategy with the ISP-property.
Eager Strategies and the Initial Segment Property
Before we prove an ISP analogue of Lemma 5.2, we need to better understand the behavior of replacement strategies with ISP. For classic caches, premature evictions and insertions do not improve efficiency. We will show that the same holds true for TCs with ISP. This is useful because we use early evictions and insertions in some of our arguments.
Definition 5.5. A replacement strategy is lazy if it performs an insertion of a missing node only if the node is accessed directly after this insertion, and it performs an eviction only before an insertion for which there would be no free cell otherwise. In the opposite case, the strategy is eager. Unless stated otherwise, we assume that a strategy being discussed is lazy.
Eager strategies can perform replacements before they are needed and can even insert nodes that are not needed at all. Also, they can insert and re-evict, or evict and reinsert nodes during a single translation. We eliminate this behavior translation by translation as follows. Consider a fixed translation and define the sets of effective evictions and insertions as follows: EE = {evict(a) : there are more evict(a) than insert(a) in the translation.} EI = {insert(a) : there are more insert(a) than evict(a) in the translation.} Please note that, in this case, "there are more" means "there is one more" because there cannot be two evict(a) without an insert(a) between them or two insert(a) without evict(a). PROOF. We modify the original evict/insert/access sequence translation by translation. Consider the current translation and let EI and EE be the set of effective insertions and evictions. We insert the missing nodes from the current translation path exactly at the moment when they are needed. Whenever this implies an insertion into a full cache, we perform one of the lowest effective evictions, where lowest means that no children of the node are in the TC. There must be such an effective eviction because, otherwise, the original sequence would overuse the cache as well. When all nodes of the current translation path are accessed, we schedule all remaining effective evictions and insertions at the beginning of the next translation; first the evictions in descendant-first order and then the insertions in ancestor-first order. The modified sequence is operationally equivalent to the original one, performs no more insertions, and does not exceed cache size. Moreover, the current translation is now lazy.
ISLRU, or LRU with the Initial Segment Property
Even without ISP, LRU has the following property:
. When the LRU policy is in use, the number of TC misses in a translation is equal to the layer number of the highest missing node on the translation path.
PROOF. The content of the LRU cache is easy to describe. Concatenate all translation paths and delete all occurrences of each node except the last. The last W nodes of the resulting sequence form the TC. Observe that an occurrence of a node is only deleted if the node is part of a latter translation path. This implies that the TC contains at most two incomplete translation paths; namely, the least recent path that still has nodes in the TC and the current path. The former path is evicted top-down, and the latter path is inserted top-down. The claim now easily follows. Let v be the highest missing node on the current translation path. If no descendant of v is contained in the TC, the claim is obvious. Otherwise, the topmost descendant present in the TC is the first node on the part of the least recent path that is still in the TC. Thus, as the current translation path is loaded into the TC, the least recent path is evicted top-down. Consequently, the gap is never reduced.
The proof to Lemma 5.9 also shows that whenever LRU detaches nodes from the initial segment, the detached nodes will never be used again. This suggests a simple (implementable) way of introducing ISP to LRU. If LRU evicts a node that still has descendants in the TC, it also evicts the descendants. The descendants actually form a single path. Next, we use Lemma 5.8 to make this algorithm lazy again. It is easy to see that the resulting algorithm is the Initial Segment-Preserving LRU (ISLRU), as defined next. Remark 5.12. In fact, the proposition holds also for W d, even though ISLRU no longer has ISP in this case. 
ISMIN: The Optimal Strategy with the Initial Segment Property
Definition 5.13. ISMIN (Initial Segment Property Preserving MIN) is the replacement strategy for TCs with ISP that always evicts from a TC the node that is not used for the longest time into the future among the nodes that are not on the current translation path and have no descendants. Nodes that will never be used again are evicted before the others in arbitrary descendant-first order.
THEOREM 5.14. ISMIN is an optimal replacement strategy among those with ISP.
PROOF. Let R be any replacement strategy with ISP, and let t be the first point in time when it departs from ISMIN. We construct R with ISP that does not depart from ISMIN, including time t, and has no more TC misses than R. Let v be the node evicted by ISMIN at time t.
We first assume that R evicts v at some later time t without accessing it in the interval (t, t ]. Then, R simply evicts v at time t and shifts the other evictions in the interval [t, t ) to one later replacement. Postponing evictions to the next replacement does not cause additional insertions and does not break connectivity. It may destroy laziness by moving an eviction of a node directly before its insertion. In this case, R skips both. Since no descendant of v is in the TC at time t, and v will not be used for the longest time into the future, none of its children will be added by R before time t ; therefore, the change does not break the connectivity.
We come to the case that R stores v until it is accessed for the next time; say, at time t . Let a be the node evicted by R at time t. R evicts v instead of a and remembers a as being special. We guarantee that the content of the TCs in the strategies R and R differs only by v and the current special node until time t and is identical afterward. To reach this goal, R replicates the behavior of R except for three situations.
(1) If R evicts the parent of the special node, R evicts the special node to preserve ISP and from then on remembers the parent as being special. As long as only Rule 1 is applied, the special node is an ancestor of a. (2) If R replaces some node b with the current special node, R skips the replacement and from then on remembers b as the special node. Since a will be accessed before v, Rule 2 is guaranteed to be applied; hence, R is guaranteed to save at least one replacement. (3) At time t , R replaces the special node with v, performing one extra replacement.
We have shown how to turn an arbitrary replacement strategy with ISP into ISMIN without efficiency loss. This proves the optimality of ISMIN.
We can now state an ISP-aware extension of Lemma 5.2. PROOF. MIN is an optimal replacement strategy, so it is better than ISMIN. ISMIN is an optimal replacement strategy among those with ISP, so it is better than ISLRU. ISLRU is better than LRU by Proposition 5.11. LRU(W) < 2MIN(W/2) holds by Lemma 5.2. PROOF. d nodes are sufficient for LRU to store one extra path; hence, from the construction, LRU on a larger cache always stores a superset of nodes stored by ISLRU. Therefore, it causes no more TC misses because it is lazy.
The proof of Theorem 5.17 is lengthy. We first derive some properties of Belady's optimal algorithm MIN(W) and then transform any MIN(W) strategy in two steps into a ISMIN(W + d) strategy.
Recall that Belady's algorithm MIN, also called the clairvoyant algorithm, is an optimal replacement policy. The algorithm always replaces the node that will not be accessed for the longest time into the future. An elegant optimality proof for this approach is provided in Michaud [2007] . MIN does not differentiate between nodes that will not be used again. Therefore, without loss of generality, let us from now on consider the descendant-first version of MIN. For any point in time, let us call all the nodes that are still to be accessed in the current translation the required nodes. The required nodes are exactly those nodes on the current translation path that are descendants of the last accessed node (or the whole path if the translation is only about to begin). PROOF. Ad. 1. If v will be accessed ever again, then w will be used earlier (in the same translation); so, MIN evicts v before w. If v will never be accessed again, then MIN evicts it before w because it is the descendants-first version. Ad. 2. Either the TC stores the whole current translation path and no eviction occurs, or there is a cell in the TC that contains a node off the current translation path; hence, the root is not evicted because it has a nonrequired descendant in the TC. Ad. 3. Either the TC stores the whole current translation path, or there is a cell c in the TC with content that will not be used before any required node; hence, no required node is the node that will not be needed for the longest time into the future. PROOF. If MIN evicts a node on the current translation path, it cannot be a descendant of the just translated node (Lemma 5.18, claim 3); it also cannot be an ancestor of the just translated node (Lemma 5.18, claim 1). Hence, only the just translated node is admissible. If the algorithm evicts a node off the current translation path, it must have no descendants (Lemma 5.18, claim 1).
LEMMA 5.21. If MIN has evicted the node that was just accessed, it will continue to do so for all the following evictions in the current translation. We refer to this as the round robin approach.
PROOF. If MIN has evicted a node w that was just accessed, it means that all the other nodes stored in the TC will be reused before the evicted node. Moreover, all subsequent nodes traversed after w in the current translation will be reused even later than w if at all. PROOF. We introduce a replacement strategy RRMIN 6 . We add a special cell rr to the TC, and we refer to the remaining W cells as regular TC. We show that the cell rr allows us to preserve ISP in the regular TC with no additional TC misses. We start with an empty TC, and we run MIN on a separate TC of size W on a side and observe its decisions.
We keep track of a partial bijection 7 ϕ t on the nodes of the translation tree. We put one timestamp t on every TC access and one more between every two accesses in the regular phase of MIN. We position evictions and insertions between the timestamps, at most one of each between two consecutive accesses. At time t, ϕ t maps every node stored by MIN in its TC to a node stored by RRMIN in its regular TC. Function ϕ t always maps nodes to (not necessarily proper) ancestors in the memory translation tree. We denote this as ϕ t( a) a, and, in the case of proper ancestors, as ϕ t( a) a. We say that a is a witness for ϕ t( a).
PROPOSITION 5.24. Since the partial bijection ϕ t always maps nodes to ancestors, for every path of the translation tree, RRMIN always stores at least as many nodes as MIN.
To prove Lemma 5.23, we need to show how to preserve the properties of the bijection ϕ t and ISP. In accordance with Corollary 5.22, MIN inserts a number of highest missing nodes in the regular phase and uses the round robin approach on the remaining ones.
Let us first consider the case when MIN has only the regular phase and inserts the complete path. In this case, we substitute evictions and insertions of MIN with these described next.
Let MIN evict a node a. If ϕ t( a) has no descendants, RRMIN evicts it. In the other case, we find ϕ t( b) a descendant of ϕ t( a) with no descendants of his own. RRMIN evicts ϕ t( b), and we set ϕ t+1 (b) := ϕ t (a). Clearly, we have preserved the properties of ϕ t+1 8 , and ISP holds. Now let MIN insert a new node e. At this point, we know that both RRMIN and MIN store all ancestors of e. If RRMIN has not yet stored e, RRMIN inserts it, and we set ϕ t+1 (e) := e. If e is already stored, it means it has a witness ϕ −1 t (e) that is a proper descendant of e. We a find a sequence e ϕ −1
t (e) = g that ends with g RRMIN has not stored yet. Such g exists because ϕ −1 t is an injection on a finite set and is undefined for e. We set ϕ t+1 (h) := h for all elements of the sequence except g. RRMIN inserts the highest not-stored ancestor f of g, and we set ϕ t+1 (g) := f . Note that the inserted node f might not be a required node. Properties of ϕ t are preserved, and RRMIN did not disconnect the tree it stores. Also, RRMIN performed the same number of evictions and insertions as MIN. Note as well that for all nodes on the translation path, ϕ t is identity. Finally, Proposition 5.24 guarantees that all accesses are safe to perform at the time they were scheduled. Now let us consider the case when MIN has both regular and round robin phases. Assume that the regular phase ends with the visit of node v. At this point, MIN stores the (nonempty for W > d due to Corollary 5.19) initial segment p v of the current path ending in v; it does not contain v's child on the current path, and it contains some number (maybe zero) of required nodes. Starting with v's child, MIN uses the round robin strategy. Whenever it has to insert a required node, it evicts its parent. Let r and rr be the number of evictions in the regular and round robin phase, respectively. RRMIN also proceeds in two phases. In the first phase, RRMIN simulates the regular phase, as described earlier. RRMIN also performs r evictions in the first phase, and ϕ t is the identity on p v at the end of the first phase; this holds because ϕ t maps nodes to ancestors and because MIN contains p v in its entirety at the end of the regular phase. Let d be the number of nodes on the current path below v; MIN stores d − rr of them at the beginning of the round robin phase, which it does not have to insert, and it does not store rr of them, which it has to insert. Since ϕ t is the identity on p v after phase 1 of the simulation and maps the d − rr required nodes stored by MIN to ancestors, RRMIN stores at least the next d − rr required nodes below v in the beginning of phase 2 of the simulation. In the round robin phase, RRMIN inserts the required nodes missing from the regular TC one after the other into rr, disregarding what MIN does. Whenever MIN replaces a node a with its child g, in case of W > d we fix ϕ t by setting ϕ t+1 (g) := ϕ t (a). By Proposition 5.24, RRMIN does no more evictions than MIN. Therefore, because it also preserves ISP in the regular TC, Lemma 5.23 holds.
LEMMA 5.25. There is a replacement strategy with ISP on a TC of size W + d that causes no more TC misses than a general optimal replacement strategy on a TC of size W.
PROOF. To prove the lemma, we show how to use additional d regular cells in the TC to provide functionality of the special cell rr while preserving ISP in the whole TC. We run the RRMIN algorithm aside on a separate TC of size W + 1, and we introduce another replacement strategy, which we call LIS, 9 on a TC of size W + d. LIS starts with an empty TC where d cells are marked. LIS preserves the following invariants:
(1) The set of nodes stored in the unmarked cells by LIS is equal to the set of nodes stored in the regular TC by RRMIN. Whenever RRMIN can replicate evictions/insertions of LIS without violating the invariants, it does. Otherwise, we consider the following cases:
(1) Let RRMIN in the regular phase evict a node a that has marked descendants in LIS. Then, LIS marks the cell containing a and unmarks and evicts one of the marked nodes with no descendants that does not store the node stored in rr. Such a node exists because the only other case is that the marked cells contain all nodes of some path excluding the root, and the leaf is stored in rr. Therefore, a is the root, but the root is never evicted due to ISP. (2) In the regular phase, RRMIN inserts a node c to an empty cell while LIS already stores c in a marked cell. In this case, LIS unmarks the cell with c and marks the empty cell. (3) In the round robin phase, RRMIN replaces the content of the cell rr, and LIS (if needed) replaces the content of an arbitrary marked node with no descendants that is not on the current translation path. Since the root is always in the TC and there are d marked cells, such a cell always exists. ISP is preserved because the parent of this node is already in the TC.
At this stage, if we drop notions of ϕ t and marked nodes, LIS becomes an eager replacement strategy on a standard TC. Therefore, we can use Lemma 5.8 to make it lazy. This concludes the proof of Lemma 5.25.
Since ISMIN is an optimal strategy with ISP, Theorem 5.17 follows from Lemma 5.25.
Remark 5.26. We believe that the requirement for d additional cache size is essentially optimal. Consider the scenario when we access memory cells uniformly at random. Informally speaking, MIN will tend to permanently store the first log K (W) levels of the translation tree because they are frequently used and will use a single cell to traverse the lower levels. To preserve ISP, we need d − log K (W) + 1 additional cells for storing the current path. Thus, only little improvement seems to be possible.
Conjecture 5.27. The strategy of storing higher nodes (Lemma 5.23) and using extra d cells to not evict nodes from the current translation path (Lemma 5.25) can be used to add ISP to any replacement strategy without efficiency loss.
ANALYSIS OF ALGORITHMS
In this section, we analyze the translation cost of some algorithms as a function of the problem size n and memory requirement m. For all the algorithms analyzed, m = (n).
In the RAM model, there is a crucial assumption that usually goes unspoken; namely, the size of a machine word is logarithmic in the number of memory cells used. If the words were shorter, one could not address the memory. If the words were longer, one could intelligently pack multiple values in one cell. This technique can be used to solve NPC problems in polynomial time. This effectively puts an upper bound on n; namely, n < 2 word length , whereas asymptotic notations make sense only when n can grow to infinity. However, this is not a limitation of the RAM model: It merely shows that to handle bigger inputs, one needs more powerful machines.
In the VAT model, there is also a set of assumptions on the model constants. The assumptions bound n by machine parameters in the same sense as in the RAM model. However, unlike in the RAM model, they can hardly go unspoken. We call them the asymptotic order relations between parameters. The assumptions we found necessary for the analysis to be meaningful are as follows:
(1) 1 τ d P; moving a single translation path to the TC costs more than a single instruction, but not more than size-of-a-page many instructions; that is, if at least one instruction is performed for each cell in a page, the cost of translating the index of the page can be amortized. (2) K 2 (i.e., the fanout of the translation tree is at least 2). (3) m/P K d 2m/P (i.e., the translation tree suffices to translate all addresses but is not much larger). As a consequence, log(m/P) d log K = dk log(2m/P) = 1 + log(m/P), and, hence, log
θ , for θ ∈ (0, 1) (i.e., the translation cache can hold at least one translation path, but is significantly smaller than the main memory).
Sequential Access
We scan an array of size n ( i.e., we need to translate addresses b, b + 1, . . . , b + n − 1 in this order, where b is the base address of the array). The translation path stays constant for P consecutive accesses; hence, at most 2n/P indices must be translated for a total cost of at most τ d(2 + n/P). By assumption (1), this is at most τ d(n/P + 2) n + 2P.
The analysis can be sharpened significantly. We keep the current translation path in the cache; hence, the first translation incurs at most d faults. The translation path changes after every P-th access and hence changes at most a total of n/P times. Of course, whenever the path changes, the last node changes. The next to last node changes after every K-th change of the last node and hence changes at most n/(P K) times. In total, we incur
n P cache faults; of these faults, at most 1 + n/P are caused by data pages, and the remaining ones are causes by internal nodes of the translation tree. The cost is therefore bounded by 2τ d + 2τ n/P 2P + 2n/d, which is asymptotically smaller than the RAM complexity.
Random Access
In the worst case, no node of any translation path is in the cache. Thus, the total translation cost is bounded by τ dn. This is at most τ n log K (2n/P)).
We next argue a lower bound. We may assume that the TC satisfies the initial segment property. The translation path ends in a random leaf of the translation tree. For every leaf, some initial segment of the path ending in this leaf is cached. Let u be an uncached node of the translation tree of minimal depth, and let v be a cached node of maximal depth. If the depth of v is larger by two or more than the depth of u, then it is better to cache u instead of v (because more leaves use u instead of v). Thus, up to one, the same number of nodes is cached on every translation path; hence, the expected length of the path cached is at most log K W, and, hence, the expected number of faults during a translation is d − log K W. The total expected cost is therefore at least τ n(d − log K W) τ nlog K n/(PW), which is asymptotically larger than the RAM complexity.
LEMMA 6.1. The memory access cost of a random scan of an array of size n is at least τ n log K (n/(PW)) and at most τ n log K (2n/P).
1.9:18
T. Jurkiewicz and K. Mehlhorn
Binary Search
We do n binary searches in an array of length n. Each search searches for a random element of the array. For simplicity, we assume that n is a power of two minus one. A binary search in an array is equivalent to a search in a balanced tree where the root is stored in location n/2, the children of the root are stored in locations n/4 and 3n/4, and so on. We cache the translation paths of the top layers of the search tree and the translation path of the current node of the search. The top layers contain 2 +1 − 1 vertices; hence, we need to store at most d2 +1 nodes 10 of the translation tree. This is feasible if d2 +1 W. For next two paragraphs, let = log(W/2d). Any of the remaining log n− steps of the binary search cause at most d cache faults. Therefore, the total cost per search is bounded by
This analysis may seem unrefined. After all, once the search leaves the top layers of the search tree, addresses of subsequent nodes differ only by n/2 , n/2 +1 , . . . , 1. However, we next argue that this bound is essentially sharp for our caching strategy. In a second step, we extend the bound to all caching strategies. By Lemma 5.1, if two virtual addresses differ by D, their translation paths differ in the last log K (D/P) nodes. Thus, the scheme incurs at least
PW cache faults. We next show that it essentially holds true for any caching strategy.
By Theorem 5.15, we may assume that ISLRU is used as the cache replacement strategy (i.e., TC contains top nodes of the recent translation paths). Let = log(2W) . There are 2 2W vertices of depth in a binary search tree. Their addresses differ by at least n/2 ; hence, for any two such addresses, their translation paths differ in at least the last z = log K (n/(2 P) nodes. Call a node at depth expensive if none of the last z nodes of its translation path is contained in the TC and inexpensive otherwise. There can be at most W inexpensive nodes; hence, with probability at least 1/2, a random binary search goes through an expensive node, call it v, at depth . Since ISLRU is the cache replacement strategy, the last z nodes of the translation path are missing for all descendants of v. Thus, by the argument in the preceding paragraph, the expected number of cache misses per search is at least 1 2
The memory access cost of n random binary searches in an array of size n is at most τ log K We know from cache-oblivious algorithms that the van-Emde Boas layout of a search tree improves locality. We show in Section 7 that this improves the translation cost.
Heapify and Heapsort
We prove a bound on the translation cost of heapify. The following proposition generalizes the analysis of sequential scan. Moreover, there is a set of x = n/(P K ) addresses such that the union of the paths has size at least x(
PROOF. The union of the translation paths to all n addresses contains at most n/P nonextremal nodes on the leaf level (= level 0) of the translation tree. On level i, i 0, from the bottom, it contains at most n/(P K i ) nonextremal nodes. We overestimate the size of the union of x translation paths by counting one node each on levels 0 to − 1 for every translation path and all nonextremal nodes contained in all the n translation paths on the levels above. Thus, the size of the union is bounded by
A node on level lies on the translation path of K P consecutive addresses. Consider addresses z + i P K for i = 0, 1, . . . , n/P K − 1, where z is the smallest in our set of n addresses. The translation paths to these addresses are disjoint from level down to level zero and use at least one node on levels + 1 to d. Thus, the size of the union is at least x( + 1) + d − .
An array A[1..n] storing elements from an ordered set is heap-ordered if A[i] A[2i] and A[i]
A[2i + 1] for all i with 1 i n/2 . An array can be turned into a heap by calling operation si f t(i) for i = n/2 down to 1. si f t(i) repeatedly interchanges z = A[i] with the smaller of its two children until the heap property is restored. We use the following translation replacement strategy. Let z = min(log n, (W − 2d − 1)/ log K (n/P) − 1). We store the extremal translation paths (2d − 1 nodes), nonextremal parts of the translation paths for z addresses a 0 , . . . , a z−1 , and one additional translation path a ∞ ( log K (n/P) nodes for each). The additional translation path is only needed when z = log n. During the siftdown of A[i], a 0 is equal to the address of A[i], a 1 is the address of one of the children of i (the one to which A[i] is moved, if it is moved), a 2 is the address of one of the grandchildren of i (the one to which A[i] is moved, if it is moved two levels down), and so on. The additional translation path a ∞ is used for all addresses that are more than z levels below the level containing i.
Let us upper bound the number of the TC misses. Preparing the extremal paths causes up to 2d + 1 misses. Next, consider the translation cost for a i , 0 i z − 1. a i assumes n/2 i distinct values. Assuming that siblings in the heap always lie in the 1.9:20 T. Jurkiewicz and K. Mehlhorn same page, 11 the index (= the part of the address that is being translated) of each a i decreases over time; hence, Proposition 6.4 bounds the number of TC misses to the number of the nonextremal nodes in the range. We use Proposition 6.5 to count them. For i ∈ {0, . . . , p}, we use the Proposition with x = n and = 0 and obtain a bound of 2n P = O n P TC misses. For i with p + ( − 1)k < i p + k, where 1 and i z − 1, we use the proposition with x = n/2 i and obtain a bound of at most
TC misses. There are n/2 z siftdowns starting in layers z and above; they use a ∞ . For each such siftdown, we need to translate at most log n addresses, and each translation causes less than d misses. The total is less than n(log n)d/2 a . Summation yields
For any realistic values of the parameters, the third term is insignificant; hence, the cost is O(τ (d + np P )). We next prove the corresponding lower bound under the additional assumption that W < 1 2 n/P. At least one address must be completely translated; hence, the cost of (τ d).
The addresses in a 0 . . . a p−1 assume at least one address per page in the subarray [n/2..n] because a i can never jump by more than 2 i+1 . First, the addresses are swept by a 0 , then by a 1 , and so on, and no other accesses to the subarray occur in the meantime. Hence, if the LRU strategy is in use and W < 1 2 n/P, there are at least pn/(2P) TC misses to the lowest level of the translation tree. This gives the ( np P ) part of the lower bound. Hence, the total cost is (τ (d + np P )). In the sorting phase of heapsort, we repeatedly remove the element stored in the root, move the element in the rightmost leaf to the root, and then let this element sift-down to restore the heap property. The siftdown starts in the root and, after accessing address i of the heap, moves to address 2i or 2i + 1. For the analysis, we make the additional assumption W = M; that is, the data cache and the TC cache have the same size. We store the top layers of the heap in the data cache and the translation paths to the vertices to these layers in the TC cache, where 2 +1 < M; say = log(M/4) = log(W/4). Each of the remaining log n − siftdown steps may cause d cache misses. The total number of cache faults is therefore bounded by nd(log n − ) n log K (2n/P) log(4n/W).
We leave the lower bound as an open problem.
CACHE-OBLIVIOUS ALGORITHMS
Algorithms for the EM model are allowed to use the parameters of the memory hierarchy in the program code. For any two adjacent levels of the hierarchy, there are two parameters: the size M of the faster memory and the size B of the blocks in which data are transferred between the faster and the slower memory. Cache-oblivious algorithms are formulated without reference to these parameters; that is, they are formulated as RAM algorithms. Only the analysis makes use of the parameters. A transfer of a block of memory is called an IO operation. For a cache-oblivious algorithm, let C(M, B, n) be the number of IO operations on an input of size n, where M is the size of the faster memory (also called cache memory) and B is the block size. Of course, B M.
Good cache-oblivious algorithms exhibit good locality of reference at all scales, and therefore, one may hope that they also show good behavior in the VAT model. The following theorem gives an upper bound of VAT complexity in terms of the EM complexity of an algorithm. THEOREM 7.1. Consider a cache-oblivious algorithm with IO complexity C(M, B, n) PROOF. We divide the cache into d parts of size a and reserve one part for each level of the translation tree.
Consider any level i, where the leaves of the translation tree (=data pages) are on level 0. Each node on level i stands for K i P addresses, and we can store a nodes. Thus, the number of faults on level i in the translation process is the same as the number of faults of the algorithm on blocks of size K i P and a memory of a blocks (i.e., size ak i P). Therefore, the number of cache faults is at most
Theorem 7.1 allows us to rederive some of the results from Section 6. For example, a linear scan of an array of length n has IO complexity at most 2 + n/B . Thus, the number of cache faults in the VAT model is at most
It also allows us to derive new results. Quicksort has IO complexity O((n/B) log(n/B)); hence, the number of cache faults in the VAT model is at most
Binary search in the van Emde Boas layout has IO complexity log B n; hence, the number of cache faults in the VAT-model is at most
where the last inequality follows from our assumption that K d P is at most twice the memory footprint of an algorithm and that the memory footprint of a binary tree with n leaves is bounded by 2n.
A matrix multiplication with a recursive layout of matrices has IO complexity n 3 /(M 1/2 B); hence, the number of cache faults in the VAT model is at most
Cache-oblivious algorithms that match the performance of the best EM algorithm for the problem are known for several fundamental algorithmic problems (e.g., sorting, FFT, matrix multiply, and searching [Frigo et al. 2012] ). Do all these algorithms automatically have small VAT complexity via Theorem 7.1? Unfortunately, the answer is no. Observe that the theorem refers to the cache misses in a machine with memory size aK i P and block size K i P (i.e., memory consists of a blocks). However, many of the good cache-oblivious algorithms require a tall-cache assumption M B 2 ; sometimes, the assumption M B 1+ for some positive suffices. For such algorithms, the theorem does not give good bounds.
In joint work with Pat Nicholson [Jurkiewicz et al. 2014] , we have recently shown that cache-oblivious algorithms requiring a tall-cache assumption also perform well in the VAT model provided a somewhat more stringent tall cache assumption holds. More precisely, consider a cache-oblivious algorithm that incurs C( M, B, n) cache faults when run on a machine with cache size M and block size B, provided that M g ( B) . Here, g : N → N is a function that captures the "tallness" requirement on the cache. We consider the execution of the algorithm on a VAT machine with cache size M and page size P and show that the number of cache faults is bounded by 4dC(M/4, dB, n) provided that M 4g(dB). Here, M = M/a, B = P/a, and a 1 is the size (in addressable units) of the items handled by the algorithm.
Funnel sort [Frigo et al. 2012 ] is an optimal cache-oblivious sorting algorithm. On an EM machine with cache size M and block size B, it sorts n items with
cache faults, provided that M B 2 . As a consequence of our main theorem, we obtain: 
Since M/(4dB)
(M/B) 1/2 for realistic values of M, B, K, and n, this implies that funnel sort is essentially optimal also in the VAT model.
DISCUSSION
In this section, we discuss additional topics that extend the scope of our research. In particular, we address the comments that we received from the ALENEX13 program committee and other researchers. 
Double Address Translation on Virtual Machines
Nowadays, increasingly more computation is performed on virtual machines in the clouds. In this environment, address translation must be performed twice, first to the virtual machine addressing space and then to the host. The cost of address translation to host can be as high as O τ log(size of virtual machine) . Moreover, big enough virtual machines may require translation for memory tables in the virtual machine, not just for the data. This is independent of the problem input size and significant in the case of random access, but still negligible in the case of sequential access. To test the impact of double address translation, we timed permutation and introsort on a virtual machine; results are provided in Figure 6 .
Please note that STL introsort takes actually less time than the permutation generator, even for very small data. This is very surprising at first but means that a high VAT cost is especially harmful for programs launched on virtual machines. Since many cloud systems are meant primarily for computing, the discussed phenomenon should be of primary concern for such environments.
The Model Is Too Complicated
While we received comments that the model is too simple, we also received ones saying that the model is too complicated. This impression is probably due to the fact that some of our proofs are somewhat technical. Some arguments simplify if asymptotic notation is used earlier or if the VAT cost is upper bounded by the RAM cost ahead of time (for sequential access patterns to the memory), or the other way around for randomized access. However, as this is the first work addressing the subject, we find it appropriate to be more detailed than absolutely necessary. With time, more and more simplifications will appear. Let us briefly discuss a few candidates.
Value of K .
There is evidence that, for many algorithms, the exact value of K does not matter, and, hence, K = 2 may be used. In some cases, such as repeated binary search, the exact value of K seems to have only a little impact both in theory and practice. In other cases, such as permutation, it seems to be the cause of bumps on the chart in Figure 1 , but the impact is moderate. A notable exception is matrix transpose and matrix multiplication, where the value of K is blatantly visible. The classic matrix transpose algorithm uses O(n) operations, where n is the input size. However, if the matrix is stored and read row by row, the output matrix must be written one element per row. For a square matrix, this means a jump of √ n cells between writes, which means √ n translations of cost (τ d) to produce the first column. Because there are √ n translations before another element is written to the same row, no translation path can be reused if we consider the LRU algorithm. Therefore, the total VAT cost is (τ nd), which is (τ n log n). Figure 7 shows that even though the asymptotic growth is intact, the translation cost grows in jumps rather than in a smooth logarithmic fashion. The distance between the jumps appears to be directly related to the value of K; namely, the jump occurs when the matrix dimension is K times greater than during the previous jump. Note that the EM cost of this algorithm is (n) for √ n · B > M, and (n/B) for √ n · B < M. In fact, the first cost jump is due to this barrier itself.
CAT or Sequence of Consecutive Address
Translations. In our analysis, for many algorithms, precisely calculated VAT complexity was much smaller than the RAM complexity. We believe that our approach can bring valuable insight for future research, but some of our results can be obtained in a simpler way. The memory access patterns in the algorithms in question share some common characteristics. There are not too few elements, they are not overspread in the memory, and the accesses are more or less performed in a sequence. We formalize these properties in the following definition. PROOF. We assume the LRU replacement strategy. First, let us assess the cost of translating addresses for all the O( /τ ) pages in increasing order. The first translation causes d TC misses. Since we allow only a constant number of operations between accesses from the considered sequence, the LRU replacement strategy holds the translation path of the last translation when the next one starts. Hence, the addresses to be translated change as in a classic K-nary counter. The amortized cost of an update of a K-nary counter is O(1). Since on average (τ ) elements are accessed per page, the access range is at most of length O( /τ ), and so the cost of updates is O( ). However, we do not start counting from zero, and the potential function in the K-nary counter analysis can reach up to log (the highest number seen), which in our case can reach d. Hence, we need to add the cost of another d TC misses to our estimation. The cost of all translations is therefore equal to
In the definition of a CAT, we do not assume that every page is used exactly once. However, neither skipping values in the counter nor reusing them causes extra TC misses.
Since the RAM cost is exactly ( ), it dominates the translation cost. -There is a memory partitioning such that each part consists of all memory cells with some common virtual address prefix, and parts are of size at least Pm θ for θ ∈ (0, 1). -For at least a constant fraction of the accesses with at least a constant probability, each access is to a part that was not accessed since W TC misses.
LEMMA 8.4. The cost of a RAT of length is (τ d). It is the same as the cost of the address translations.
PROOF. We assume the LRU replacement strategy. Since parts are of size at least Pm θ for θ ∈ (0, 1), a translation of an address from each part uses (d) translation nodes unique to its translation subtree. Therefore, an access to a part that was not accessed since W TC misses, misses the root of the subtree, and, by Lemma 5.9, the access causes (d) misses. Because this happens for at least a constant fraction of the accesses with at least a constant probability, the total cost is (τ d). The RAM cost is only ( ), which is less than the VAT cost by order assumption 1.
Larger Page Sizes
The straightforward method to determine how the VAT affects the running time of programs would be to switch it off and compare the results. Unfortunately, no modern operating system provides such an option. One can approximate the elimination of address translation by increasing the page size. If all the data fit into a single page, address translation is essentially eliminated. If all the data fit into a small number of pages, the number of translations and their cost is reduced. We performed experiments with larger page sizes. However, whereas hardware architectures support pages sized in gigabytes, operating systems do not. Quoting Hennessy and Patterson [2007] :
Relying on the operating systems to change the page size over time. The Alpha architects had an elaborate plan to grow the architecture over time by growing its page size, even building it into the size of its virtual address. When it came time to grow page sizes with later Alphas, the operating system designers balked and the virtual memory system was revised to grow the address space while maintaining the 8 KB page. Architects of other computers noticed very high TLB miss rates, and so added multiple, larger page sizes to the TLB. The hope was that operating systems programmers would allocate an object to the largest page that made sense, thereby preserving TLB entries. After a decade of trying, most operating systems use these superpages only for handpicked functions: mapping the display memory or other I/O devices, or using very large pages for the database code.
There are good reasons why operating systems designer are reluctant to offer larger pages. The main concern is space. Pages must be correctly aligned in memory, so bigger pages lead to a greater waste of memory and limited flexibility while paging to disk. Another problem is that because most processes are small, using bigger pages would lengthen their initialization time. Therefore, current operating system kernels provide only basic, nontransparent support for bigger pages. The hugetlbpage feature of current Linux kernels allows one to use pages of size 2MiB on AMD64 machines. The following links describe the hugetlbpage-feature:
-http://linuxgazette.net/155/krishnakumar.html -https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt -https://www.kernel.org/doc/Documentation/vm/hugepage-shm.c -http://man7.org/linux/man-pages/man2/shmget.2.html
The feature attaches a final real address one level higher in the memory table (i.e., the last layer of nodes is eliminated from the translation trees, and pages are now of size 2 9+12 ). This slightly decreases cache usage, decreases the number of nodes needed in each single translation but one, and, finally, increases the range of addresses covered by the related entry in the TLB by 512.
We rerun the permute, introsort, and binsearch on the same machine, with and without use of the big pages. Figure 8 clearly shows that use of bigger pages can introduce a speedup. In other words, the cost of virtual address translation can be partially reduced by use of the bigger pages.
The Translation Tree Is Shallow
It is true that the height of the translation tree on today's machines is bounded by 4, and so the translation cost is bounded. However, even though our experiments use only three levels, the slowdown appears to be at least as significant in practice as the one caused by a factor of log n in the operational complexity. Therefore, decreasing VAT complexity has a prominent practical significance. Please note that while 64-bit addresses are sufficient to address any memory that can ever be constructed according to known physics, there are other practical reasons to consider longer addresses. Therefore, the current bound for the height of the translation tree is not absolute.
What About Hashing?
We have been asked whether the current VAT system could be replaced with one based on hashing tables to achieve a constant amortized translation time. We argue that it is not a good idea. First and foremost, hashing tables sometimes need rehashing, and this would mean the complete blockage of an operating system. Moreover, an adversary can try to increase the number of necessary rehashes. Note that probabilistic guarantees are on the frequencies of the rehashes, and the program isolation is insufficient to discard this concern because an attack can be performed with side channels as, for example, differential power analysis [Tiri 2007 ]. Finally, a tree walk is simple enough to be supported by hardware to obtain significant speedups; in case of hashing, this would be not so easy.
On the other hand, simple hash tables can be used to implement efficient caches. In fact, associative memory can be seen as a hardware implementation of a hashing table. If we no longer require from the associative memory that it reliably stores all the previous entries, then associative memories of small enough sizes can be well supported by hardware. This is, in fact, how the TLB is implemented and one of the reasons why it is so small.
CONCLUSION
We introduced the VAT model and analyzed some fundamental algorithms in this model. We have shown that the predictions made by the model agree well with measured running times. Our work is just the beginning. In follow-up, we show together with Patrick Nicholson [Jurkiewicz et al. 2014 ] that all cache-oblivious algorithms perform well in the VAT model provided a tall cache assumption that is somewhat more stringent than for the EM model. It would be interesting to know whether this more stringent assumption is necessary.
