Guarded Page Tables implement huge sparsely occupied address spaces efficiently and have the advantages of multi-level tables (tree structure, hierarchy, sharing). We present an implementation of guarded page tables on the R4600 processor. The paper describes both the architecture-dependent design process of the algorithms and the resulting tool box.
R a t i o n a l e
This work was originated as part of the Mungi [2] project at UNSW, which aims to build an objectoriented single address space operating system. Since it makes heavy use of a sparsely-occupied address space, the VM system must be targeted to support sparsity efficiently. We selected the Guarded Page Table mechanism (see section 2) which combines the advantages of multi-level and inverted page tables.
The critical point was whether the G P T mechanism could be implemented efficiently on the R4600 processor. Therefore, we developed R4600-specific G P T parsing algorithms (section 3) and complemented them with a second-level software TLB (section 5). How to best combine the elements, depends on both the concrete memory system (cache and memory timing) and the TLB-miss characteristics of the OS and applications. Therefore, we include a detailed performance discussion and make the software available as a tool box. Independent of the concrete problem, section 3 can serve as an example of architecturedependent micro optimization. An interesting result is that about 2/3 of the optimization process -though architecture-dependent -can be made in terms of a high-level language and are based on algorithmic and data structure optimizations. The example shows that substantial performance gains (factors of 2.5 or more) are achievable by combining this method with specific assembler-level optimizations where general automatic code optimization techniques do not help.
G u a r d e d P a g e T a b l e s
Guarded Page Tables have been described in [6, 7] . They combine the advantages of tree-structured multi-level page tables and hashed page tables: unlimited sparsity (2 page table entries per mapped page are always sufficient), tree structure (subtree sharing, hierarchical operations) and multiple page sizes. These properties are described more detailed in [5, 8] . Here we give only a short sketch of the basic mechanism.
The main problem with multilevel page tables is sparsity: we need huge amounts of page table entries for non-mapped pages. Look at the following example where the mapping of page 11 10 11 00 in a sparsely occupied address space is shown. (For demonstration purposes we use very small addresses and small page tables. Nil pointers are marked by ".".) The second-and third-level page table are extremely sparse page tables: each contains one single non-nil entry. Consequently, there is only one valid path through these two tables: when the leftmost two bits are "11", the subsequent address bits must be "10 11"; all other addresses lead to page faults.
As shown in figure 1, bles and skip the associated translation steps. Whenever entry 3 of the top-level page table is reached, we have to check whether "10 11" is a prefix of the remaining address. If so, this prefix can be stripped off, and the translation process can directly continue at the level-4 page table. Therefore, each entry is augmented with a bit string g of variable length, which is referred to as a guard. This is the key idea of guarded page tables.
The translation process works as follows: first, a page table entry is selected by the highest part of the virtual address upon each transformation step in the same way as in the conventional multi-level page table method. The selected entry however contains not only a pointer (and perhaps an access attribute) but also the guard g. If g is a prefix of the remaining virtual address, the translation process either continues with the remaining postfix or terminates with the postfix as page offset. As an example, figure 2 presents the transformation of 20 address bits by 3 page tables. Note that the length of the guards may vary from entry to entry. Furthermore, page table sizes can be mixed; all powers of 2 are admissible. The same holds for data pages, i.e., 
3

GPT Parser
At first, we describe a GPT translation step in general, independent of concrete hardware (see figure 3) . Here, v is the part of the original virtual address that is still subject to translation, and the pair (p, s) determines the page table (p: physical address, s: log 2 of table size) that has to be used for the current translation step. The result of this step is either a new page table (/, s') and a postfix v' of v, or the data page (/, J) and offset v '. The translations step starts by extracting u, the uppermost s bits of v. u is used for indexing the page table. The addressed entry specifies a guard g of variable size, i.e. possibly empty, which is checked against the remaining bits of the virtual address (w = g). When equal, the remaining v ~ is either used for the next level translation, or as the offset part. This operates as a shortcut, since not only u, but both u and w are stripped off the virtual address in one step; no table is necessary to decode w.
Note that the width of u, (determined by the page table size), may vary from step to step and that the size of w may differ from entry to entry. Step requires only 14 arithmetic/load operations and no longer needs the variable vl~.
The next optimization is based on the idea of adjusting the guard bits in the G P T entry variable and extending it by the number u of this entry requires only 10 arithmetic/load operations and avoids the per entry field grnaskUp to this point, we have looked at only one translation step. For a complete translation, a loop is required. To approximate an until-loop, we first move the then-part statements before the if statement. This is possible because these three statements do not destroy yet required data: The loop terminates when a page fault, i.e. a guard mismatch, is detected. Of course, the translation process must also terminate in the positive case, i.e. if the translation finishes without page fault. Adding a further termination condition to the loop would increase our costs per translation step.
A better solution is to introduce a pseudo mismatch at leaf page table entries. We need an extended guard G, which includes the matching guard g, which in all cases leads to a mismatch, i.e. (v XOR G) >> sl :~ 0. Now recall that the extended guard of the u ~h entry of a page table always contains the index u. Therefore, we can achieve a pseudo mismatch by using an "incorrect" u for building the extended guard. G = ((fi << Ig[)+g) ~ sl with f i ¢ u always leads to a mismatch:
The loop terminates either due to detecting a page fault or a leaf entry. In the case of
we have a pseudo mismatch, i.e. a successful translation. For the mentioned check, we need a field holding the value 64-Igl. In leaf entries, the s~-field is free and can be used for this purpose. 
R4600 Implementation
Before presenting a concrete implementation of GPT parsing, a brief R4600 introduction is necessary. The R4600 is a member of the MIPS R4000 family of processors which feature 64-bit integer and floating point operations. They have thirty-two general purpose 64-bit registers of which two are special. Register rO ignores writes and always returns zero when read. Register r31 is used to store the return address of Jump And Link (JAL) instructions.
The R4600 has a primary 16KB instruction cache and a 16KB data cache on chip. Both caches are two-way set associative, use a 32 byte line size, and FIFO replacement within a set. Secondary cache is external and optional.
A four (64-bit) word write buffer is used to buffer writes to external memory arising from cache write-back, cache write-through, and uncached stores. This enables the processor to proceed in parallel while external memory is updated.
The R4600 has a five stage pipeline which has a one cycle latency for computational instructions. Computational instructions perform arithmetic, logical, and shifting operations using register operands or a register operand and a 16-bit signed immediate.
Load instructions don't allow the instruction immediately following, termed the load delay slot, to use the result of the load, thus giving a load latency of two cycles. Scheduling of instructions in the delay slot is desirable for increased throughput, though not strictly required, as the pipeline will slip one cycle in the case of a dependent instruction in the delay slot.
All jump and branch instructions have a latency of 2 cycles. The instruction in the delay slot following the jump is executed while the target of the jump is being fetched. The exception being if a conditional branch likely instruction is not taken, in which case the delay slot instruction is nullified.
From 11 To 8 Instructions
For the R4600 implementation, four 64-bit registers are needed. We name them rl, r2, v and P. A first compilation of the algorithm leads to 11 instructions per translation step: 
Note that all load delay slots in this (and the following) versions are filled with useful operations, i.e. do not cost additional cycles. By using appropriate coding 1, the same holds for the branch delay slot.
1 Use the bzl instructS.on which nullifies the immediately following instruction if the branch is not taken:
Further optimizating, we use the fact that the R4600's minimal page size is 4K and the range of s~ and sl is always 0... 63. Therefore 2 × 6 = 12 bits are sufficient for s~ and sl and since the 12 lowermost bits of G are never used, we combine these three fields in one 64-bit word:
The second 64-bit word is used for pointing to the next level 
Timing
Since no instruction interlocks are effective in the algorithm, i.e. since all delay slots are filled with senseful instructions, for an n-step guarded page Since, within one address space, the R4600 supports 40-bit addresses and the smallest page is 4K, no more than (40-12)/4 = 7 translation steps should be necessary [5] per translation. Recall that the required steps can vary from page to page. Less than 7 steps are required in very sparse or in contiguous regions. It seems reasonable to expect 3 to / steps, depending on OS strategy and type of application. Assuming 4 cycles penalty for a cache miss, this corresponds to costs of [24... 36] (3 steps) up to [56...84] (7 steps) cycles per GPT walk.
R4600 Memory Management
An introduction to R4600 memory management is needed before further presenting GPT implementation. The R4000 architecture has a 64-bit virtual address space, however the R4600 only implements a 1TB (40-bit) user mode virtual address space together with a 64 GB physical address space. It uses a joint translation lookaside buffer (JTLB) to translate instruction and data virtual memory references to physical memory references. The JTLB is a 48 entry fully associative memory. Each entry maps an even-odd pair of virtual pages to their corresponding physical addresses, giving a potential of 96 mapped virtual pages. Page size is per entry configurable from 4KB to 16MB in multiples of 4.
An 8 bit address space identifier (ASID) is associated with each entry in the JTLB. The ASID is used together with the virtual address when checking for a match, thus allowing multiple address spaces in the JTLB simultaneously, which reduces the need for JTLB flushing during context switching.
The R4600 also contains a 2 entry instruction TLB (ITLB) and a 4 entry data TLB (DTLB), with each entry mapping a 4KB page. ITLB and DTLB misses are automatically refilled from the JTLB making operation of the ITLB and DTLB transparent to users.
The handling of JTLB misses is via a TLB Refill exception and a software routine to load a new entry into the JTLB. Other TLB related exceptions are handled by the processor general exception mechanism, alleviating the TLB refill routine from determining the exception involved, allowing it to be optimized solely for refill. Refill software can overwrite selected TLB entries or use a hardware provided mechanism to overwrite a randomly selected entry.
TLB Refill in Detail
TLB refill has been measured contributing up to 40% of total execution time [3] in some applications. While such high contributions are not normal, it is none the less important to mininize TLB refill costs as much as possible. Before presenting or analyzing any TLB refill routines, the basic cost of taking a null exception (C~,¢p,) needs to be determined. This is the cost of taking an exception that simply performs an exception return (eret) instruction. An exception generating instruction causes execution to begin, at the appropriate exception vector, when it reaches the fifth stage of the pipeline [4] : cost 4 cycles. Assuming eret has a delay slot similar to a branch or jump, it costs 2 cycles. Thus C~,cpt = 6 cycles.
Refill--Virtual Array To serve as a reference, the best case TLB refill is presented. However before presentation, four coprocessor 0 (CP0) registers need introducing.
MIPS designers provide limited hardware support to speed up the software refill process via the Context or XContezt registers. The Context register is a 32 bit version of the 64 bit XContext register, which is described below.
The XContezt register illustrated in figure 4 , contains an operating system setable Page Table Entry Base (PTEBase) field which is used to store the base of a page table array. Upon a TLB miss, the BadVPN2 field is set to the virtual page-pair number that misses. For 4K pages, the register can simply be used as the address of a page The timing of GPT refill (Cgv,) where n is the number of levels traversed in the page table is:
I IclDIvlol
Refill--Skeleton Before presenting more complicated refill routines, the following TLB refill skeleton is factored out as it is common in all routines presented later. The skeleton loads the miss address from a CP0 register and frees an extra register. After page table entries are loaded it: loads the page entries into EntryLo registers, writes the TLB, and restores the freed register. The timing of the skeleton(Cskez) is 9 cycles. If extra registers are needed for page table lookup, it costs 2 cycles per register (Cxreg).
Refill--GPT Firstly, GPT translation is modified slightly. Instead of translation terminating with P pointing to the physical address, it finishes with P pointing to and even-odd pair of page table entries suitable for direct loading into EntryLo.
Using the skeleton above, with BadVAddr as the CP0_veg (which contains the address at which the TLB miss occured), the GPT refill routine is: c p, = C~¢v, + C, kez + C~r~g + 5 + 8n
For the 3 level lookup Cgvt3 = 46 cycles, for a 7 level lookup Cgpt7 = 78 cycles.
Cache Effects So far it has been assumed that all data and instructions are in cache. Instruction cache misses will have similar effects on all refill routines with the penalty being proportional to the length of the routine. However, data cache misses have the potential to show large differences between the two refill routines as the amount of data accessed vari¢~ markedly. Given a data cache penalty of: 6 cycles for the a single doubleword, plus 2 cycles for each extra double word, up to 12 cycles for an entire cache line3; data access can be expensive.
The best case routine assuming cache misses is Cbest.,~ = Cb~,t + 8. For the GPT routine Cgvt.~n = Cgvt+8(n+l)+6. Table 1 show the cost for the refill routines presented so far, assuming all data cache hits and then assuming all data cache misses. Cap,7
Refill Comparison Direct comparison between
78 148 Table 1 : TLB refill routine cost (cycles).
into account the frequency of TLB misses. In the extreme, it does not matter how long refill takes if the TLB never misses. To facilitate a more revealing comparison, we use the metric of percentage of cycles due to tlb refill (%ttb) compared to total cycles, which we aim to minimize. Assumming cycles due to TLB refill (Ctzb), and grouping other cycles Figure 6 illustrates the TLB overhead associated with the six routines tabulated above, for various miss rates.
For avoiding misunderstandings, we explicitly mention:
• The miss rate we used is neither the TLB miss rate per memory access nor per instruction. Instead, we use the miss cost per cycle that is not related to TLB miss. Cum grano salis, these are instruction execution and cache miss cycles. For illustration: assume an application with a TLB miss rate per L D / S T instruction of 1% (which is high), on average one L D / S T per 3 instructions and 5% cache miss rate (8 cycles penalty). Then one TLB miss occurs per 340 cycles, i.e. our TLB miss rate r ,~, is 0.003. Figure 6 : TLB overhead for TLB refill routines * We use the "best" mechanism for comparison only. Its TLB refill cost is a theoretical minimum. In practice, higher-level page table misses impose additional costs. Nagle et al. [10] report up to twofold increase even for traditional (non-sparse) applications and operation systems.
It can be seen that with miss rates less than 0.0001, it is largely irrelevent which routine is chosen for TLB refill, as refill's contribution to overall runtime is negligible.
In the case of high miss rates, for example 0.01, TLB overheads are significantly different. The best case routine overhead is expected to vary between 13% and 19%, however G P T overhead varies between 32% and 59%. Or to look at it differently, given a tolerable overhead of 10%, the best case routine can tolerate miss rates 2-10 times higher than G P T refill.
Thus it appears G P T s are unsuitable for TLB refill where it is expected that TLB miss rates may be high, especially if cache miss costs are also high.
T h e S e c o n d L e v e l T L B
Ideally, a robust mechanism is needed that supports address space sparsity, fast lookup, hiearchical opt.-rations, and graceful performance degradation when faced with increasing TLB miss rates. A second level TLB (like described in [1] ) in combination with GPTs should be the answer. The second level TLB (TLB2) is a software cache of page table entries used to refill the hardware TLB.
5.1
TLB2 Design Issues
Tagged or Per-Process
The first design decision to be made is whether TLB2 should be a per-process cache or a global, address space tagged, cache. A per-process cache slows the context switch time as the cache base address needs to be changed, though this may be insignificant when compared to other switching overheads.
A single tagged cache is more space efficient. A per-process cache takes n times the space for n processes for the same potential per-process cache capacity. A single tagged cache will adapt to the workload, caching only active TLB entries, whereas a per-process cache may itself be entirely inactive.
A single tagged cache is small enough to use unmapped physical memory. A per-process cache is more suited to implementation in virtual memory as the number of processes is unknown and potentially large. Virtual memory implementation requires handling of complex nested TLB misses which are avoided in the physical implementation.
Flushing all cache entries associated with a physical frame is simpler and faster with a single tagged cache, than with n per-process caches of similar size.
For these reasons, we choose to is a single tagged cache for TLB2.
Size
Required performance dictates the size of TLB2, however the following factors make it desirable to keep TLB2 small. TLB2 uses unmapped physical memory which is a limited resource, though it is expected that TLB2 will be small enough effectively ignore this limitation.
TLB2 flushing grows more expensive as size increases. Flushing can be on a per physical page frame basis, or on a per address space tag basis. These events occur, for example, on page frame swapout and address space destruction respectively. These are expected to be infrequent operations when compared to TLB2 lookup, though they should be kept in mind when sizing TLB2.
The R4600 has 16-bit immediates. This gives a 16-bit mask operation or a load operation from a 64KB address space, in a single instruction. Larger masks or load offsets require multiple instructions. This needs to be kept in mind as TLB2 lookup is time critical. The performance gained by having a large cache may be offset by the extra time taken to access it.
Associativity
High associativity is desirable in a cache to decrease the likelyhood of conflict misses. In a hardware cache implementation, n associativity requires n comparisons in parallel to determine a hit. In software, n associativity requires n comparisons in sequence. Sequential comparisons need to be minimized as TLB2 lookup is time critical. The tradeoff between increased lookup time due to sequential comparisons and decreased miss rate due to associativity needs to be carefully balanced.
A Direct-Mapped TLB2
Before describing a direct mapped TLB2, another CP0 register needs introducing. The EntryHi register is used to set the hardware lookup tag in a TLB entry when adding a new TLB entry or probing for an existing one. It contains a virtual page number of a page-pair (VPN2) and an associated address space identifier (ASID) as illustrated in Figure 7 .
EntryHi is set on TLB miss to a value appropriate for adding a new entry into the TLB. It also be set by the operating system in the case when adding a TLB entry not associated with a TLB exception is required. The optimisition of this is to recognise the upper 34 bits of the page table entries are always zero. This allows two 32-bit page table entries to be stored in a single 64-bit word, giving a block size of two 64-bit words which is easily indexed in TLB2.
I RI LI VPN2 101ASIDI
This optimisation costs nothing in terms of speed. The two 64-bit page table entries would be loaded using two "load double" instructions. The optimistized 32-bit entries are loaded using two "load word" instructions which sign extend the values to 64-bit once loaded for free. By having two TLB2 blocks within a single 32 byte data cache line instead of one, the compact structure may indeed be faster as it reduces the chance of a data cache miss on load.
The refill routine to implement a direct mapped TLB2 is: The timing for a hit is C~cpt+C~ket+8 = 23 cycles. A miss is a little more complicated as it includes a GPT lookup, and replacing the missed TLB2 entry (Cr~pz) • The cost is C~,cpt+C,k~l+7+Cgp,+ Cr~ W. The TLB2 miss routine is:
GPT level Cache hits Cache misses  hit  miss  hit  miss  3  23  60  31  104  7  23  92  31  168   Table 2 : Direct mapped TLB2 costs Now, assuming TLB2 is sized such that it has, on average, a 10% miss rate. The average timing for the case of 3 level GPT translation assuming data cache hits is 0.9 * 23 + 0.1 * 60 = 26.7. The worst case average timing assuming 7 level translation with cache misses is 0.9 * 31 + 0.1 * 168 = 44.7.
With the assumption of 10% TLB2 miss rate, fi-. gure 8 shows the TLB overhead for: best case refill, 
Concluding Remarks
The presented software is available through the WorldWideWeb under http:Hwww.vast.unsw.edu.au/Mungi/Mungi.html.
A more detailed version of this paper (including a discussion of page access right and n-way TLB2s) is available as UNSW Technical Report [9] .
