Many commercial microprocessor architectures have added lranslafion lookaside buffer (TLB) support for superpages. Superpages differ from segments because their size must be a power of two multiple of the base page size and they must be aligned in both virtual and physical address spaces. Very large superpages (e.g., lMB) are clearly useful for mapping special structures, such as kernel data or frame buffers. This paper considers the architectural and operating system support required to exploit mediumsized superpages (e.g., 64KB, i.e., sixteen times a 4KB base page size). First, we show that superpages improve TLB performance only after invasive operating system modifications that introduce considerable overhead. 
pages. Superpages differ from segments because their size must be a power of two multiple of the base page size and they must be aligned in both virtual and physical address spaces. Very large superpages (e.g., lMB) are clearly useful for mapping special structures, such as kernel data or frame buffers. This paper considers the architectural and operating system support required to exploit mediumsized superpages (e.g., 64KB, i.e., sixteen times a 4KB base page size). First, we show that superpages improve TLB performance only after invasive operating system modifications that introduce considerable overhead.
We then propose two subblock TLB designs as alternate ways to improve TLB performance. Analogous to a subblock cache, a complete-subblock TLB associates a tag with a superpage-sized region but has valid bits, physical page number, attributes, etc., for each possible base-page mapping. A partial-subblock TLB entry is much smaller than a complete-subblock TLB entry, because it shares physical page ' Sun Microsystems Ex&nal Re~earchGrant;The experimen& were performed on equipment donated by Sun Microsystems, Permission to copy without fee all or part of this material is granted protided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee and/ors ecific permission. 1 ASPLO V1-10/94 San Jose, California USA 0 1994 ACM 0-89791 -660-3/94/0010..$3.50 translation lookaside bufferl (TLB) . A TLB is a cache whose tags are virtual page numbers (VPN) and data are physical page numbers (PPN), page attributes (e.g., protection, cacheability), and optional reference and modified bits [Mile90, Henn90, Smit82] . TLBs must be studied again, because of current workload and processor trends.
Future workloads will demand greater TLB reachthe maximum size of memory mapped by a TLB-than today. Typical physical memory sizes continue to follow their historical exponential growth curve with 100MB+ memories likely to be common when the microprocessors being designed today are deployed in systems. It seems unlikely that this demand for physical memory is occurring without a commensurate increase in memory use (e.g., larger working sets). Furthermore, the growing importance of nontraditional computation, such as multimedia, is likely to increase memory usage and change locality patterns. TLBs must be designed for larger TLB reach to support future applications.
Furthermore, processor trends require that the increased TLB reach be provided with a fast TLB access time. Dramatic reductions in processor cycles-per-instruction (CPI), from ten to less than one, have increased the relative importance of TLBs. In addition, the continued use of physically-tagged level-one caches places TLB access times on the cache-access critical path. Furthermore, the trend toward supporting multiple cache accesses per cycle (e.g., Intel Pentium and SGI R8000 (TFP)) also means that the TLB must support multiple translations per cycle through multi-porting or replication.
Multi-porting increases TLB complexity and access time, while replication increases cost. Both suggest that the brute force solution of increasing TLB reach by making larger TLBs may be unattractive.
Finally, current and future TLBs are on microprocessor chips, so TLB design is a part of chip design instead of system design, as in the past. Prudent designers will seek TLBs that serve many workloads to avoid condemning their chips to limited markets.
For these and other reasons, several recent microprocessor architectures support one or more sizes of superpages.
Superpages use the same linear address space 1 . Also known as Translation Buffer (TB), Directory LookAside The clear motivation for supporting superpages is that using them appears to increase TLB reach for free. This is certainly true for very large superpages (e.g., lMB) as they are very effective in mapping large objects such as kernel data, frame buffers and large arrays. Some architectures specify two TLBs-one for base pages and another for superpages (e.g., PowerPC)-and allow for restricted use of superpages with special operating system support. We assume that TLBs will include special support for large superpages and operating systems will use them.
In this study, we concentrate only on the benefits and costs of supporting medium-sized superpages (e.g., 64KB). Thus, in the rest of this paper, when we say superpages we mean medium-sized superpages.
The impact of supporting (medium-sized) superpages in TLBs is twofold. First, it appears that the TLB must be fully-associative, because selecting a set with the least significant bits of the virtual page number is difficult when the page size in not known [Tal192] . The SGI R8000 (TFP), for example, implements a set-associative TLB, but restricts a process to a single page size [MIPS93] . Second, the complexity or time needed to handle TLB misses is likely to be larger for superpages. We expect this to be offset easily by the reduction in the number of TLB misses. The same algorithm can be used to efficiently support superpages. Most facets of paged virtual memory operating system policies and mechanisms require modifications to support superpages effectively. We first describe a new policy-page-size assignment-and two new mechanisms-page promotion and page demotion. We then briefly discuss the impact of supporting superpages on existing operating system policies and mechanisms.
A page-size assignment policy decides the page size to use for each virtual address. The policy may change over time, differ between objects and differ between processes. A policy must balance the costs and benefits of using superpages. A static page-size assignment policy will make the decision once and fix the page size over the life of the mapping (e.g., for frame buffers). Often the operating system does not know, in advance, enough about the costs and characteristics of accesses to the object to make an informed static decision.
The operating system will then have to use a dynamic page-size assignment policy guessing a page size to use and modifying it, if the guess was incorrect. Implementation of the policy will span the virtual memory manager and the file systems.
Two additional operating system mechanisms support a dynamic page-size assignment policy. Page promotion is the mechanism by which a set of pages are coalesced to a larger superpage. Page demotion is the reverse process. The operating system uses these mechanisms when it decides to switch page sizes for a virtual address range. Page demotion involves unloading the superpage mapping, and possibly replacing it with base page mappings.
Page promotion may involve verifying that the base pages are compatible for promotion, unloading any existing base page mappings from the page tables and TLBs, allocating contiguous physical memory, cop ying the base pages to contiguous memory-a gather operation-doing additional 1/0 and updating page tables and TLBs. A gather operation is very expensive and may more than offset any TLB performance improvement due to use of superpages.
The impact of adding superpage support to operating systems is twofold.
First, it adds significant overhead (time spent in the operating system) and makes superpages less attractive.
These overheads are fundamental to the use of superpages and independent of the operating system. For example, using superpages increases the amount of 1/0, page initialization overhead, and page fault latency. If the operating system is efficient and reduces the cost of these overheads, superpages can be used more often to improve TLB performance. For example, intelligent physical memory allocation can remove the need for a gather operation during page promotion.
Also, TLB misses incur a higher average miss penalty since the page tables are expected to be more complicated when using superpages.
Second, the changes required for efficient superpage support are invasive and affect large portions of existing operating systems. Physical memory management, for example, must be overhauled to handle variable sizes and external fragmentation [Knut68, Pete77] . Many key data structures (e.g., page tables) and interfaces need to be redesigned. Use of superpages often conflicts with file system~ead-ahead and r~qu-ires coordination on what would otherwise have been local policy decisions. Many of these changes also adversely affect the performance of programs that do not use superpages. Table 2 lists a sample of the overheads that operating systems incur when using superpages.
For superpages to be useful, the TLB benefit of using superpages must be greater than the costs due to these overheads. Table 3 li~ts some modifications to important operating system policies and mechanisms to support superpages efficiently. Table 4 lists some situations where superpages are inadequate.
A detailed discussion is beyond the scope of this paper. In the next section, we describe subblock TLBs, which reduce the burden on the operating system, and, also deliver better TLB performance.
Subblock ?'L~s
Subblocking2, borrowed from cache design, makes TLBs more effective than superpages while requiring During periods of high memory demand, external fragmentation prevents use of superpages. Many of the operating system modifications for superpages continue to add overhead, even though there is no further TLB benefit.
Linear arrays and hash tables do not scale efficiently to include superpages. e.g., page tables and most hash tables. Algorithms traversing more complicated data structures take longer, e.g., TLB miss penalty increases.
Superpages increase internal fragmentation and memory demand. Significant time is spent in I/O and initializing memory that the program never references.
Operating system has to guarantee that the constraints for superpages are satisfied. Adds a check, sometimes of information not easily accessible.
If the base pages involved in the promotion are not contiguous in physical memory, the contents must be copied to a superpage. Adds significant overhead. 
OS Mechanism/Policy Modifications
Page-size assignment New policy. It is difficult to balance the costs and benefits of page promotion as both are often not easily estimated or known in advance.
Page Replacement Replacement policies, such as CLOCK, give equal weight to all pages. Superpages have a higher cost. Requires re-evaluation of page replacement policies.
File System read-aheadl File systems and device drivers read-ahead and cluster I/O into efficient large Page Clustering & requests. Superpages already include this benefit. File systems and the virtual Page-size assignment memory manager must coordinate to avoid making locally-optimal decisions.
Physical Memory Management & Page Coloring
Superpages already include some of the benefits of page coloring-a superpage consists of one base page from each physical equivalence class. But large physically-tagged caches will require page coloring with superpages too.
Aliases and Synonyms
Aliases could use different page sizes and the page sizes for a virtual address and corresponding physical address may differ. Complicates data structures. Applications using tine-grain protection (e.g., copy-on-write) have to forgo the benefits of superpages if even a singe base page has different attributes.
I

I Virtual Address Allocation
Objects mapped into an address space may not start or end at superpage-aligned addresses. Restricts use of superpages.
Interfaces Many interfaces assume a single page size. e.g., external pager interfaces. Superpages are hard to use efficiently with existing interfaces.
simpler operating system support. Subblock TLBs have the TLB reach advantages of superpages and, in addition, can exploit these advantages more often than superpages, (e.g., for objects smaller than superpage size). However, subblocking requires larger TLB entries and additional control logic. Figure 1 illustrates the structure of a single entry for the four different types of TLBs we will consider. Entries may be combined to build fully-associative or setassociative TLBs. The first entry illustrates a non-subblocked TLB entry that maps a single base page and consists of Tag and Data fields. The Tag consists a virtual page number (VPN), and Data contains a physical page number (PPN), attributes (Attr., e.g., protection, cacheability), modified (Mod) and valid (V) bits.
The next entry in Figure 1 illustrates a superpage TLB entry. The Tag includes a Size field that masks bits during tag compare and physical address generation.
Next, Figure 1 illustrates a complete-subblock TLB entry with subblock factor 4. A complete-subblock TLB entry with a subblock factor n has an n times larger data portion but a log2(n) bits smaller tag than a nonsubblocked TLB entry. The MIPS R4xO0 has, for example, a complete-subblock TLB with a subblock factor of 2 [Kane92]. On a TLB miss, before attempting a replacement, the tags and valid bits are checked to see if an empty subblock can hold the mapping.
Alternatively, all subblocks can be loaded on a TLB miss. The IBM RS/6000 and ARM 6x0, for example, support subblock attributes and require all subblocks to be valid.
There are at least six advantages to using completesubblocking over superpages, even though subblock and superpage
TLBs have the same TLB reach (with subblock factor n and superpage size n times the base page size). First, complete-subblock TLBs allow applications to get all the benefits of using superpages with no operating system modifications beyond the TLB management code. Second, complete-subblock entries can map multiple base pages in situations where superpages cannot be used, such as, for unaligned segments, small objects, nonuniform attributes. The final entry of Figure 1 shows a partial-subblock TLB entry (with subblock factor 4) where we coalesce the four pairs of PPN and Attr fields into a single pair. The good news is that the entry is now not much larger in area than a superpage TLB entry, yet by maintaining individual valid and modified bits we retain many advantages of complete subblocking.
The bad news is that base pages can share a partial-subblock TLB entry only if they have identical attributes and are properly placed in a superpage region. to four consecutive virtual pages that are backed by noncontiguous physical pages. A singlepage-size TLB will require all four TLB entries. A superpage TLB also will require four TLB entries3 as all the physical pages were not contiguous.
A complete-subblock TLB will use a single TLB entry. A partial-subblock TLB will use three TLB entries as two pages (VPNS 110110& 110111) are aligned and share an entry.
One implementation difficulty for partial-subblock TLBs is that the same tag value could be loaded into more than one TLB entry, as happened in Figure 2 when the PPNs were not aligned. Such a multiple match condition could cause electrical problems for some implementations.
A straightforward solution is to combine the valid bits with the tag so that, at most, one TLB entry can match on a lookup. This change may slightly increase TLB area and access time.
Finally, some operating system support is necessary to make partial-subblock TLBs effective. If the operating system succeeds in allocating aligned physical pages for two out of four base pages, a partial-subblock TLB can use just three TLB entries to hold the four mappings (Figure 2) . The operating system could use a best-effort allocation algorithm and does not have to guarantee 3. A 8KB superpage could be used for pages 3 and 4, if supported., Further, if the page promotion costs were justified, a superpage of 16KB could be used after copying the pages into a superpage. Figure 3 shows how partial-subblock TLBs can increase their effective TLB reach when mapping small objects, objects with unaligned starting addresses and copy-on-write cases, situations where superpages cannot be used.
A superpage mapping requires only one TLB miss to be loaded into the TLB, but a subblock TLB will require multiple TLB misses for all the subblocks to be loaded. By paying a higher miss penalty, the TLB miss handler could preload the mappings corresponding to all the subblocks that will be cached in the same TLB entry or all mappings in a superpage region even if they must be cached in different TLB entries. In either case, TLB hardware support can help reduce a much higher TLB miss penalty (e.g., to check whether page attributes and PPNs are compatible).
Superpages and partial-subblocking can be integrated into a single TLB. Partial-subblock support is very effective for medium-sized objects for which the operating system finds page-size assignment hard, for example, program text, shared libraries, and data and many heap segments. Superpages are easily the choice for very large objects for which the user or operating system can do static page-size assignment, for example, kernel data, frame buffers and large heaps.
This section explained subblock TLBs and their advantages, Partial-subblock
TLBs are most effective with operating system support for aligned physical memory allocation.
In the next section we describe an efficient algorithm for such physical memory allocation.
Physical Memory Allocation-Page reservation
The effectiveness of superpage and partial-subblock TLBs depends on the ability of the operating system to allocate aligned physical memory. In this section we describe a physical memory allocation algorithm, page reservation, which attempts to allocate memory in a way that helps TLB performance.
While superpages require the operating system to guarantee contiguity and require page promotions that may include gather operations, partial-subblocking only requires a best-effort.
Physical memory is usually divided into equal-sized pages that are marked as either free or busy. A busy page has the contents of one page of an object (e.g., disk file, heap). The operating system maintains index structures to map physical pages to their identity (<object id, offset>) and vice versa. When a new page is required, the physical memory allocator searches the index structure to avoid duplicate allocations. Then it chooses a free page and updates the index structures, More than one process may map the same physical page using different virtual addresses. Hence, the physical memory manager uses the unique object page identity instead of virtual addresses in the index structures. Page reservation requires a new state for pages-reserved. A reserved page has an identity and is inserted into the index structures. However, its contents are not valid-similar to an "in-transit" state used during 1/0. The operating system maintains the reserved pages in a reserved list-analogous to the free list.
Page reservation works as follows. On an initial base page fault (or during read-ahead by the file system), the physical memory manager allocates a superpage-sized region of base pages. With superpage size 64KB, for example, a page fault to address OX41O34 allocates sixteen base pages: the object pages corresponding to virtual addresses 0x40000, OX41OOO, 0x42000,..., Ox4fOO0. The accessed base page (OX41OOO) is loaded as normal and marked busy. Other base pages are marked reserved and added to the end of the reserved list. If the physical memory manager runs out of free pages, it frees pages by moving them from the head of the reserved list to the end of the free list. Subsequent page faults may find previously reserved base physical pages reserved or not. If reserved, the reserved physical page will be allocated and marked busy; if not, a physical page from the free list will be allocated.
Page reservation provides a natural feedback mechanism for improving the effectiveness of partial-subblock and superpage TLBs without unduly increasing memory demand. In periods of low memory demand, pages will be allocated from reserved physicai pages, allowing multiple base pages to share a partial-subblock TLB entry. Superpage TLBs benefit, because page promotion can be done without the cost of gathering base pages together. In periods of high memory demand, on the other hand, base pages will be rapidly removed from the reserved list and reallocated, gracefully degrading the page allocation policy back to the standard "fully-associative" non-superpage approach. Thus, there should be no significant change in the page fault rate from the non-superpage implementation. Pete77] . Some file systems also use similar techniques to reserve disk space [McKu84] .
Page reservation significantly improves the performance of partial-subblock TLBs and reduces page promotion cost if using superpages. However, we did not study the effect of page reservation on cache behavior.
TLB Simulation Methodology
In this section, we describe the operating system support, simulation technique, metrics, and workloads used to compare the performance of single-page-size, superpage, and subblock TLBs. Foxtrot supports partial-subblock TLBs using page reservation and file system prefetching. For page reservation, Foxtrot uses a superpage-sized region that corresponds to the TLB type, e.g., a TLB with subblock factor 16 will use a superpage of 64KB. When objects are smaller than a superpage-sized region, Foxtrot only reserves base pages up to the object size. Sometimes, the object must also be reserved (e.g., heaps require swap space allocation).
For disk files, Foxtrot also initiates asynchronous 1/0 for the region. File system clustering makes the 1/0 more efficient. Foxtrot does not prefetch for nfs and heap objects as it is more expensive.
When supporting superpages, Foxtrot uses a dynamic page-size assignment policy which does page reservation and prefetching as in the partial-subblock TLB case, and, in addition, makes policy decisions on when to promote base pages to superpages as follows: q For every superpage region, the virtual memory manager maintains a count of the base pages within the region that have mappings in the page table. Page promotion occurs when the count exceeds the page promotion threshold. The page promotion threshold depends on the cost of populating the pages-heap (1OOYO), disk files (50Y0), nfs files (75%). While this may not be the optimal policy or the most efficient implementation, it is "a" policy. Superpage and partial-subblock TLB simulations without operating system support are unrealistic.
While Foxtrot can support many page-size assignment policies, this paper focuses on TLB performance by fixing the operating system mechanisms and policies.
Trap-based simulation
We use trap-based simulation to compare the performance of superpage and subblock TLBs. Trap-based simulation for TLBs manipulates the valid bits in the page table to invoke a TLB simulator on page faults. The simulator maintains a data structure corresponding to the TLB under study, the target TLB, and marks valid only those page table entries that reflect the contents of the target TLB. This technique invokes the simulator only on target TLB misses and never on hits [Uhli94]. The kernel is modified to account for operating system effects (e.g., TLB invalidations) and superpage support.
Trap-based simulation has significant advantages over trace-driven simulation. First, TLB simulation requires information that is hard to encapsulate in a trace, such as page-size assignment, physical page numbers, and attributes.
The simulator has access to such information in the kernel, Second, trap-based simulation incurs overhead only on very infrequent TLB misses, allowing hits to proceed at hardware speed. Our simulator runs three to four orders of magnitude faster than a trace-driven simulation. Third, trap-based simulation naturally extends to multi-program workloads.
The key disadvantage of trap-based simulation is the inability to calculate the number of TLB hits without hardware support such as profiling counters [Site93] or external probes [Nag192] . This makes it difficult to use normalized metrics (e.g., TLB miss ratio).
Foxtrot implements trap-based simulation for SPARC V8 processors [SPAR91] . The cost of a target TLB miss-including trap cost, TLB simulator complexity and wrappers, much of which is written in C-is 1500 to 4000 cycles, comparable to the overhead seen by others [UhIi94, Rein93]. Our implementation, however, does not account for kernel references.
Metrics
While the ultimate measure of TLB performance is the fraction of execution time spent in servicing TLB misses, the TLB miss ratio is often used instead. As explained above, our simulator lacks the capability to count the number of TLB hits and we use the unnormal-ized number of TLB misses as our metric for comparing different TLBs. We also normalize the number of TLB misses by dividing by the number of TLB misses in an equivalent single-page-size TLB. In Table 5 , we also include the cache miss counts, obtained from profiling counters on the machine.
5.4, TLB replacement algorithm
We use a pseudo-LRU TLB replacement algorithm for fully-associative TLBs. The algorithm is similar to the "Go Down Stack (GODS)" algorithm described by Deville et al. [Devi92] . We associate an used bit with every TLB entry that is set on hits to that entry. On a miss: (a) if there are any unfilled (invalid) TLB entries, we choose the first one for replacement; (b) if there are no unfilled TLB entries, we choose the first one with the used bit clear, and (c) if there are no unused TLB entries, we clear all the used bits and retry the algorithm.
Workloads
Many programs have negligible TLB miss ratios and do not justify the overhead of page promotion required to use superpages.
We concentrate on benchmarks where TLB miss handling is a significant part of the execution time, because we expect it to be true for future workloads. Table 5 displays benchmark data, with the benchmarks sorted from most to least percent of user time spent on TLB miss handling.
Columns two and three give total and user execution time, showing that these benchmarks spend most of their time in user mode. Columns four and five give the number of user TLB misses (for a 64-entry fully-associative single-page-size TLB) and the percent of user time spent servicing these misses (assuming a 40 cycle TLB miss penalty). The data show that user TLB miss handling time is significant. Column six also supports this conclusion, showing that some benchmarks have more user TLB misses than user plus system caches misses (with a lMB directmapped cache with 32-byte blocks). TLB misses may be even more important then cache misses, because, in many systems, the TLB miss penalty is larger than the cache miss penalty. Finally, column seven displays peak memory usage, showing that none of these benchmarks paged on our 96MB machine.
6, TLB Performance Study
In this section, we present simulation studies of superpage and subblock TLBs using operating system support from Foxtrot. Both types of TLBs use Foxtrot's page reservation (Section 4), while superpage TLBs also require page promotion (Section 5.1). Table 6 shows the number of user TLB misses for the benchmarks using 64-entry fully-associative unified TLBs with a single page size of 4KB, two page sizes of 4KB and 64KB, and partial-and complete-subblocking with a subblock factor of 16. The TLB replacement algorithm is described in Section 5.4. We also include the results for a singlepage-size 256-entry 4-way set-associative TLB using random replacement.
In parenthesis we normalize the TLB misses with respect to the single-page-size TLB.
Comparing TLB Misses
The second column of Table 6 demonstrates that using superpages can reduce TLB misses significantly.
The SPEC benchmarks and mp3d see an order of magnitude reduction in the number of TLB misses. Not shown in this table is that the improvement comes from a few large mappings since only 1OYO-2OYO of misses were to superpages. The ML and coral results show that superpages are effective with very large data sets too. The data also shows that the operating system can implement a good page-size assignment-we did not modify the applications.
Fftpde and pthor, however, show little improvement due to sparse access patterns.
The third column demonstrates that partial-subblock TLBs usually perform significantly better than superpage TLBs for reasons given in Section 3. However, subblock TLBs can have more TLB misses than super- TLBs. This shows that the operating system was very successful at allocating physical memory to support partial-subblock TLBs. Copy-on-write situations, which Foxtrot does not optimize, account for most of the difference between the performance of complete-and partial-subblock TLBs. Thus one can choose between the large TLB size for complete-subblocking and the operating system support for partial-subblocking. We have not yet come up with a convincing explanation for why fftpde (flagged with asterisks) performs slightly worse with complete-subblocking than with partial-subblocking.
The brute force method of increasing TLB reach is to build a much larger TLB that supports only a single page size. The key advantage of this approach is that no operating system changes are needed. The disadvantage is that the larger TLB may have an unacceptably large access time and/or chip area. The final column explores this possibility with a 256-entry TLB that uses four-way set-associativity instead of a fully-associative design. Results show that the larger TLB suffers more misses than a 64-entry fully-associative partial-subblock TLB. However, in the absence of operating system support or in the presence of excessive external fragmentation, superpage and partial-subblock TLBs degenerate to a single-page-size TLB. Under these conditions setassociative single-page-size or complete-subblock TLBs should be favored.
The data, so far, assume 64-entry fully-associative TLBs with a superpage size of 64KB or partial-subblocking with subblock factor of 16. Results of sensitivity analysis-not included here-show that varying the superpage size from 16KB to 64KB, the subblock factor from 2 to 16, and TLB size from 32 to 256 entries do not qualitatively alter the conclusions [Tal194].
Here we size TLBs t: get comparable-numb~r of TLB misses to see which TLB minimizes chip area [Joup94, Nag194]. We estimate the chip area required to implement a single-ported TLB using the on-chip cache area model proposed by Mulder et al. [Muld91] with the assumptions given in the Appendix. Table 7 gives the number of single-page-size, partial-and complete-subblock TLB entries required to get comparable number of misses to a 64-entry superpage TLB and the corresponding area normalized with respect to the area for the 64-entry superpage TLB. We obtained these numbers by iteratively rerunning our simulation varying the TLB size until the TLB miss counts were comparable.
This analysis ignores, however, that operating system overheads and TLB miss penalties can differ significantly. 
Appendix: Area Model Assumptions
We made the following assumptions while using the area model suggested for on-chip fully-associative caches by Mulder et. al. [Muld91] .
The units are register bit equivalents (rbe). Areafac = F'LA + RAM + CAM = 130 + 0.6 * (#entries + 6) * ((#data bits + #status bits) + 6) + 0.6 * (J2 * #entries + 6) * (42 * #tag bits + 6)
The tag bits include a 12-bit PID and a 52-bit VPN (64-bit virtual address -12-bit base page offset). In subblock TLBs the VPN is log2(subblock factor) bits smaller.
The data bits include a 36-bit PPN (48-bit physical address -12-bit base page offset) and 8 bits of attributes. They also include the modified and valid bits that are one bit each. In partial-subblock TLBs we count the valid bits as tag bits, though they are not true CAM cells. Partial-subblock TLBs have one additional attribute bit (SB).
There is one status bit per TLB entry-the used bit (for LRU replacement).
For superpage TLBs, we assume a 4-bit size field in CAM. In implementations, the size field is neither completely in CAM or RAM. It functions as a mask in tag compare and controls physical address generation too.
We use the model's assumptions "as is" about the size of drivers (6), sense amps (6), PLA (130), RAM cells (0.6 rbe), CAM cells (1.2 rbe) and CAM aspect ratio (1:1) .
