In the design of SPUR, a high-performance multiprocessor workstation, the need for large "snooping" caches suggests a new approach to virtual address translation. By performing this translation in each processor's virtual cache, the need for separate translation lookaside buffers is eliminated. Tracedriven simulations show that normal cache behavior is only minimally effected, and that unless an extremely large and complex TLB were built, using a separate device would actually reduce system performance.
1. Introduc tion As early as the 1960's, computer systems were providing users with the abstractio n known as virtual memory. First appearing in the Atlas computer [Foth61J, virtual memory eliminated the need for program overlays by automatic ally transferrin g data to and from "backing store". Programm ers were given the illusion of a much larger address space and programs were now independe nt of the size of memory. The authors of MULTI CS called virtual memory "generaliz ed addressing " [Dale68] and demonstra ted that the use of separate address spaces could provide protection and sharing in a controlled fashion.
Machines lacking virtual memory are considered to be severely limited. The migration from physical to virtual memory has taken place in computers at all levels. Of mainframe s we saw the progressio n from IBM S/360 to S/370 [Case78J, in minicomp uters from the DEC PDP-11 to V.A.X-11 [Stre78] , and among workstatio ns from the Xerox Alto to the Dorado [Pier83J. In each case, virtual memory was cited as one chief feature of the new architectu re. .AJ3 computer architects consider the structure of future systems, they must face the issue of how to support virtual memory.
Using multiple processors to obtain higher performan ce from single-use r v.-orkstatio ns is one area of recent research in computer architectu re. This is one topic being explored by the SPUR (Symbolic Processing Using RISCs) project at Berkeley. Machines of this type are likely to be characteri zed by faster processors and more physical memory than was previously feasible. Large multiple virtual address space:; will be provided for protection and sharing among programs. Lastly, the multiproce ssor nature of these systems dictates the need for the consistenc y of shared data, including address translation informatio n. This paper examines some of the existing means of hardware support for the translation of virtual to physical addresses. These methods are then evaluated in light of the SPUR multiprocessor RISC project. The proposal for SPUR uses the existing caches as translation buffers, doing away with the need to build a separate device. It also solves the problem of data consistency for translation information. Trace-driven simulations show that the cost of a separate translation buffer cannot be justified. These results suggest that unless a large and complex TLB were built, this separate device would actually reduce performance.
Although the motivation for this study arises from multiprocessors, ihe results are equally valid for uniprocessors. In addition, translation mechanisms are typically understood and evaluated for a single processor. Therefore a umprocessor model will be assumed unless the effects of multiprocessing are relevant.
Brief Survey of Existing Translation Mechanisms
To provide for virtual addressing, memory is divided into fixed-size blocks called pages that can then be relocated both in primary and disk storage. Virtua.l addresses used by programs must be translated, or mapped, into physicaJ addresses before memory may be accessed. By some combination of hardware and software lookup, the translation process converts the virtual page number into an address in physical memory. The offset o( a particular byte within a page remains unaltered. bits describin g the particula r byte within the page are not altered.
Translati on normally consists of some form of table lookup using the virtual page number to index into a page table maintaine d by the operating system. An individua l page table entry (PTE) contains the physical location of the page plus any associated status or protectio n bits. This lookup could be done entirely in software, but to improve performa nce, some form of hardware support is usually provided. The subsectio ns below describe some typical methods.
The Original Transla tion Seheme
Atlas, the first virtual memory computer , employed a simple translatio n scheme. A register was associated with each of the 32 pages of core memory. Each register contained the virtual page number of the page stored in that physical page frame. When a reference to memory was made, the vutual page number was compared with each of these page registers in parallel. If there was a match, the word from the appropria te page was sent back to the processor ; otherwise , the superviso r was invoked to bring the required page in from drum storage. Given the size of typical memories today, this fully-asso ciative lookup would be prohibitiv ely expensive . A modern memory may contain 256 megabyte s of storage, and even with a 4 kilobyte page size the hardware would have to compare over 64,000 addresses in parallel.
A Fully Residen t Memory Map
Perhaps the most intuitive means of hardware support is to keep a page table that maps all virtual pages to the correspon ding physical page. Figure 2 .2 shows how the page table for a single address space may be kept fully resident in a specially-dedicated "Mappin g R.Alvf". The virtual p2.ge number addresses the entry in the RAM that contains the physical page number and flag bits for the page.
This approach is used by the Xerox Dorado [Clar81} and is feasible because only one (256Mb) virtual address space is supported . However, with multiple large address spaces, the amount of high-spee d RA..M required becomes impractic al. Since the Dorado virtual page number is 18 bits, the map must have entries for the 256K lKb pages. Physical page numbers are 14 bits, so with the flags this requires 17 256K RAM chips: a little over 128K bytes. If the virtual address space were 32 bits, even with a larger 4K byte page, over a million mapping entries would be needed. \Vith more physical memory, the amount of the mapping RAM required can easily exceed the size of the Dorado's original 8 mega. byte main memory.
This approach is simple but wasteful. Address spaces are typically used sparsely: often only the first and last kilobyte. Furtherm ore, it is well known that programs tend to exhibit locality of reference [Denn72] . There is uo need to keep With a single, small address space, it is feasible to keep all of the page mapping in special memory. One RAM access is then sufficient to perform translation . the entire mapping resident.
Another limiting factor with this design lies in supportin g multiple address spaces. \Vith only one space, entries in the map are changed infrequen tly. \Vith sevenl different address spaces, however, either copies of the mapping hardware must be provided for each space, or all the entries must be invalidat ed when context switches occur. Context switches typically happen on every interrupt , and these occur about 10 to 100 times a second. Invalidat ing and rewriting roughly one megabyte of RAM this frequently would create severe performa nce problems .
Two-lev el Implem entation or a Sparse Mapping
In a machine with multiple address spaces, only one mapping is valid at a time. These multiple address spaces may be considere d to be subspaces of one larger global address space. Figure 2 .3 shows bow this is done by extending the virtual address with a a context register to identify the current subspace. Since switching occurs between a few active contexts and only a portion of the address space for each is used, a sparse mapping of this global address space will suffice. This is achieved by splitting the one RA.·~vl in the previous method into separate segment and page maps. The SUN Workstat ion employs this type of a mechanis m [Bech82] and allows for at most 8 loaded contexts. This method requires far less RA...\f than if the entire space were kept resident. For example, the SUN 2.0 uses 2K bytes in the Segment RAM and 8K bytes in the Page RAM to translate the more frequently used portions of multiple address spaces that would require 25GK bytes in the previous method. The operating system loads as many of the page table entries for the currently active contexts (processes) as possible. Although many different contexts exist n.t any point in time, in practice only a few are highly active. \Vhen switching between these few active contexts, no invalidation of the map is necessary. Only when a new context becomes active, or an old con~ext gets remapped, must entries be rewritten.
There are several dra,vbacks to this technique. This now rec;-uires two RAM accesses as compared to one in the previous method. Since the "page" field from the virtual address is ap!)ended to toe ~egment number to index into Page RAM, pages cannot be mapped individually. SUN uses a four bit field that forces mapping to be in conti:p1ous blocks of 16 pages. Finally, the operating system must successfully predict what portions of address spaces will be the most active to get the best performance. bit of the virtual addres s being append ed to index into the transla tion buffer. rr this bit disting uishes betwee n separa te system and user regions of the addres s space, then the buffer is effectiv ely divided into these two regions as well. The advant age to this is that on contex t switch , only the user-sp ecific half of the buffer needs to be invalid ated. The switch to a new user's virtual addres s space require s that the old mappi ng be purged , while the shared system space remain s unchan ged. The VAX-1 1/780 does precise ly this [DEC 81].
Demand Caching of Page
The ELXSI 6400 has extend ed the notion of separa te user and system divisio ns of the TLB by provid ing sixteen copies of the hardw are. At any given time, one copy is design ated to provid e the curren t user mappin g. The sv;·itch betwee n user contex ts can thus be perform ed with little invalid ation by selecti ng the approp riate user copy. This is similar to the use of the contex t registe r in the SUN two-lev el mappi ng scheme . The SPUR multiproc essor, shown in Figure 3 .1, will consist of about ten processor -cache pairs on a common bus to shared memory. Two custom VLSI chips form the basis of each processor -cache pair. The first is a 32-bit Reduced Each processor is a RISC with an on-chip instruction buffer (IB). Large virtual caches provide high speed access to instruction s and data, greatly reduce bus traffic, and maintain consistenc y by "snooping " on bus transactio ns. Shared memory and I/0 are accessed through a common system bus.
Instruction Set Computer (RISC) [Patt85] , and the second is a cache controller and bus interface chip. Also associated with each pn.ir are a custom VLSI floating point co-processor and a collection of RAM and buffer chips (not shown). The shared global memory can be addressed up to 4 gigabytes.
The RISC processor is a tagged architecture with an instruction set tailored to execute LISP. It has a cycle time of about 150 nanoseconds. In the tradition of RISC, the Execution Unit contains a large register file, and control of the 32-bit datapath is handled by a four-stage pipeline [Kate83] . It is the task of the Instruction Unit to prefetch and buffer instructions in an effort to deliver one per cycle to the processor.
The caches in the system not only provide much faster access time than main memory, but also significantly reduce the total bus cycles needed for execution [Good83] . \Vithout the caches, contention for the bus would limit the effective processors to just a few. Our studies have led us to specify each cache to be 128 kilobytes, direct-mapped , with transfers between memory and the cache handled in four-word (32 byte) blocks [Katz85a].
In any system with multiple caches, the problem of cache consistency arises. When a processor changes a block in its cache, other processors must not be allowed to read a "stale" copy residing in their own cache. To prevent this inconsistency, when cache writes occur copies of the same block in the other caches must either be updated or invalidated.
The caches in the SPUR system use a distributed ownership protocol to maintain consistency of ::hared data [Katz85b] . A block of memory may be contained in multiple caches for reading, but only one cache may "own" it for writing. Initially, all blocks are owned by memory. When a processor writes to a block not owned by its cache, the block's current owner relinquishes ownership and places the block on the bus. Any other caches with copies of that entry must invdidate it. To do this, the cache controller not only heeds processor requests, but is dual-ported to monitor bus requests: it "snoops" on the bus. By using a write-back policy, instead of write-through, bus traffic is even further reduced by only updating memory when a modified block must be flushed from the cache.
To provide minimal memory access time, SPUR uses virtual address caches: the caches are referenced directly by the processor with virtual addresses. Because of this, address translation is only required after a cache miss when a transaction to memory must be initiated. Ii the cache was addressed with physical addresses, the translation would have to occur on every processor reference. To retain reasonable effective memory access times, the translation mechanism would have to be extremely fast. Since cache misses only account for a small percentage of the total references, often around one or two percent, translation in SPUR does not have to be this fast to still yield high performance.
One appealing possibility is to design a system with virtual addresses on the bus. Translation would occur only when absolutely needed: at the memory. In an effort to mm1m1ze prototype design time, we decided not to modify the bus or memory devices, and this possibility was ruled out. The SPUR multiproces sor therefore performs translations at each processor, but only if the cache misses. See Appendix A for a more complete discussion of the consideratio ns in locating address translation.
Two complicatio ns arise because of the decision to use virtual address caches, The first is the danger of synonyms: two virtual addresses that refer to the same physical location. If this were allowed, it would be possible for two or more entries to appear in the cache for the same location in memory. This \vould complicate the task of maintaining cache consistency . In SPUR, synonyms are prohibited by the operating system.
The second problem is the need for reverse translation : mapping a. physical address back to virtual address. This is often done as a solution to the synonyr.-1 problem, but arises in SPUR because of the need ior the cache to snoop on the bus. Since the Lus must transmit physical locations to memory, and the cache is referenced by virtual addresses, a reverse translation would have to be done. SPUR eliminates the need for reverse translations by transmittin g both physical and virtual addresses over the bus. Caches snoop on the virtual address, and memory uses the physical address. The SPUR virtual memory allows for multiple large address spaces by providing one large giobal virtual adciress space. Each process's virtual address space is divided into four segments: stack, heap, code, and system. The global virtual address is formed in the cache by appending the segment number from one of the four active segment registers correspondin g to these divisions.
Active Segment Registers
The SPlJR virtual memory model supports multiple address spaces by l'>xtending the processor's virtual address to a larger global virtual address. This global space is divided into 256 !-gigabyte segments, each mapped independentl y. As figure 3.2 shows, the top two bits of the processor virtual address are used to select one of four segments that are desi€,rnated active. Thus, a process's virtual space is composed of four gigabytes divided into stack, heap, code, and system space. Processes share segments on an "aU-or-nothi ng" basis: if any portion of a segment is shared by two processes, the whole segment must be shared.
Each segment is divided into 256K 4K byte pages. This implies that to map the entire global address space, over 04 million page table entries would have to Associated with the number of the active segment is the base address of the root page table for that segment. The high-order eight bits of the virtual page number index into the root page table to find the base of the appropriate page table. The kw-order ten bits of the virtual page number select the page table entry for the desired page. The offset field then specifies the byte within the page. See Figure  3 .5 for an explanation of how these addresses are formed in the cache controller. be kept. Since each PTE is one word, this would require 256 megabyt es of memory. By adopting a two-level page table structure , the page tables may to be written out to disk. Each of the "meta" page table entries, or root PTEs, maps one page of PTEs, for a total mapping 4 megabyt es. Thus, these Root Page Tables require only 256K bytes to map the entire global address space, and are kept resident in memory. Figure 3 .3 shows the two-leve l mapping structure for one segment.
The In-Cach e Transla tion Process
Since translati on is to be done at each processo r, the same issue of data consisten cy that plagued multiple caches arises here as well. A translati on buffer is really nothing more than a cache for page table entries, and snooping caches are designed to solve these consisten cy problems . \Vhy not just use the existin!; ca~hes to provide the translatio n? Aside from saving physical storage, there is now a second reason for the two-leve l structure that places the page tables in the virtual address space: PTEs must have virtual addresse s in order to be cacheabl e. It has already been determin ed that the caches must be large to keep bus contentio n to an acceptab le level, so it appears unlikely that the number of entries required for translati on will cause much pollution . For example, the 128K byte cache holds 4,096 blocks. If only 32 of these happene d to contain PTEs, this would be enough to map one megabyt e of memory.
In SPUR the task of address translati on is entirely the responsib ility of the cache-co ntroller chip. This device is a custom VLSI circuit and already requires a complex control for snooping and other operation s like the writebac k of dirty blocks and selective invalidat ion. The addition of control for address translati on therefore represen ts only a small addition al complica tion.
As shown before in Figure 3 .2, the cache is reference d with the global virtual address formed by concaten ating the selected active segment number to the virtual address supplied by the processo r. Figure 3 .4 shows the the four c&Ses of operatio ns that may be done to complete a memory reference .
In the most frequent case, the cache hits (A), and data is delivered in only one cycle. In preparat ion for a miss, a concaten ate-and-extract circuit in the cache controlle r forms the virtual address of the page table entry during the reference . If translati on is required, the cache controlle r uses this address and attempts to read the page table entry in the following cycle. Figure 3 .5 shows how this address, and others in the translati on process are formed.
Case B correspo nds to a miss in the cache, and a hit in the "TLB". This requires one addition al cache reference for the PTE, and one memory transfer to fetch the desired data block. If the PTE is not cached (C), the third cache reference is for the root PTE, from which the address of the PTE in memory may be formed. After the PTE is fetched from memory, it is loaded into the cache for use in future translatio ns. In the worst case (D), all three cache reference s fail, and the root PTE must also be fetched from memory and cached. These root page tables can always be found at physical locations associate d with each of the four active segment numbers .
More than three memory operatio ns may occur if a write-ba ck of a cache block occurs. \Vhen a block from memory replaces a block in the cache that has oeen modified , this dirty block must be written back to memory. To perform the write without recursive ly needing another translatio n, the physical tag for each block is kept in cache tag-mem ory.
On examinin g any page table entry, the desired page may be shown to be invalid, indicatin g that the page is not in memory but resides on disk. In this event, a trap to the page fault handler is taken. The remainin g bits of the PTE are used as an index into a table managed by the operatin g system that contains the disk addresse s of the pages in secondar y storage.
In most systems, referenc e and dirty bits for each page are kept in the PTE to handle the replacem ent and write-ba ck of pages in memory. The SPUR system does not support true reference bits, but instead an approxim ation we refer to as the miss bit. A true reference bit would require bringing the correspo nding page table entry into the cache for every reference to the cache. Instead, the miss bit This figur'-' shows how addresses are formed for the worst case scenario in the preceding figure. Up to three virtual addresses are formed to reference th~ cache for the reque~ted data, page table entry, and root page table elitry, respectiYely . At most, all three correspond ing physical addresses must be formed and a separate bus-memor y transaction is performed to fetch each block. The dashed lines divide the same four cases that were presented in Figure 3 .4. The active segment number and page table base registers are contained in the cache controller.
is set only when a reference to a cache block misses. In this event, the PTE must be brought into the cache anyway to carry out the address translation . The operating system can periodicall y reset these bits at intervals observed to provide the best performance. The algorithm used for replacing pages in memory is therefore a;, approximation to a true "Least Recently Used" policy. In a similar attempt to limit the amount of writing to PTEs, the dirty bit is set not on every write to a block, but only ;vhen a cache requests write-ownership. A cache cannot write to a block until it has requested write-ovmership, therefore every initial write by a cache will ensure that the dirty bit is set. Again, one chief advantage to this method of translation is that all information in the page table entries is kept consistent across the multiple processors.
4. Simulation of In-Cache Translation 4..1. Methodology To evaluate the performance of the SPUR in-cache translation mechanism and other translation buffers, the Dineroii cache simulator [Hill84] wn.s used. This simulator is address-trace driven and reports miss rates and bus traffic for specified ra.cbe parameters. \Vitb modifications for the caching of page table entries, these simulations provided information about the cost of in-cache translation, and how it compares to using a separate translation buffer. Table 4 .1 shows the five address traces used to drive the simulations and the amount of virtual memorv that each references. The first four were gathered on a VAX running UNIX with an address and instruction tracer [Henr84] . LISZT is the Franz LISP compiler compiling itself. V AXIMA is an algebraic manipulator written in LISP performing a series of integrations, matrix operations, and solving differential equations. CS20K and CSlOOK are traces composed of two separate sections of the V AXIMA trace and designed to simulate context switching. They are identical except for the switching interval, which is 20,000 and 100,000 references, respectively. MVS is a series of calls to this operating system and was traced on an .Amdahl 470 [Smit85] . This last trace references a much larger range of virtual memory.
Addreaa Tracea Used
Although the VAX instruction set is different from that of a RISC architecture, the only RISC traces available were of smail program compilations that exhibited optimistic cache performance. The results of a study of the VAX TLB by Clark and Emer [Clar85] show miss rates higher than those produced by the RISC ~ompilation trace, but were not ~ bigh as those of the VAXIM.A. trace (See Appendix B). SPUR is designed to be a symbolic machine, and V AXIMA is written is LISP, so this trace is more typical of program-<~ that will be run. The behavior of V AXIMA is therefore a conservative estimate of TL.B performance.
Measurements of timesharing systems like the VAX [Emer~-:.~1 show context switches occurring about every 6500 instructions. This corresponds to roughly every 20,000 references including both instructions and data. There is less experience with single-user machines, but their interrupt rates should be dominated by the pace of one person interacting with the workstation. This would suggest context switch rates of about 10 to 100 a second. With a cycle time approaching 100 nanoseconds, the 100 interrupts per second would result in switching every 100,000 references.
The traces CS20K and CSlOOK were used to simulate context switching. They interleave two different Vaxima traces at intervals of 20,000 and 100,000 references, respectively. Rather than flush the cache on context switch, the two reference streams are in separate address spaces. The SPUR caches will not need to be flushed on context switch. Although blocks from different contexts can displace each other in the cache, the use of segment numbers will ensure different virtual addresses and will therefore not cause false hits. On a multiprocessor designed for one user, the number of separate processes active on one processor is likely to be small. Hence, only the two streams are interleaved.
The instruction and address tracer used on the VAX is capable of measuring user processes oniy. There is therefore nothing to show system performance, and the MVS trace was acquired for this purpose. This particular section of the trace shows extremely poor locality and as Table 4 .1 shows, references a much larger range of virtual memory than the other traces. There is a high frequency of :MVS memory references whose addresses agree in the low-order bits, causing them to index to the same entries in a cache. MVS therefore shows unusually high rate of cache collisions: cases where one reference "bumps out" an older block being stored. Over 12.5% or all references are to just the first 32 byte block. Collisions here account for 15.3% of all misses when simula.ting a direct-mapped cache.
:tvnrs then, provides a solid upper bound on miss rates, and accentuates the characteristics of the cache when varying parameters (see Appendix C).
All five traces contain one million references. Although this represents under one second of execution, this length was necessary given the available resources. Several longer traces of five million references were run and miss rates did not differ to within one-hundredth or a percent.
Performance of In-Cache Translation
There are two opposing views or the SP"lJR cache/translat ion buffer: a cache being corrupted by page table entries, or a translation buffer being polluted by instructions and data. Table 4 .2 shows the result: an increase in cache miss rate because both functions are being performed in the cache. This total additional miss rate i" computed by dividing the misses added when PTEs are cached by the total number of references made to the cache by the processor.
There is an important distinction to be made: processor references to the cache are for instructions and data, while the cache refers to itself for page -17-I entries in the translatio n process. The last two columns of Table 4 .2 separate the total additiona l miss rate according to this distinctio n. In the column labelled "Collision s", the processor is experienc ing additiona l misses on instructio ns and data because normal cache contents are bei:ug displaced by PTEs. This is strictly a cost of performin g translatio n in the cache. The "PTE Misses" column, on the other hand, reflects the additiona l misses incurred only when the cache is being reference d for the translatio n process. This is the measure of the performa nce of the translatio n mechanis m: the "TLB" miss rate for SPUR. This measure includes not only the misses on page table references , but root page tables as well. }..s we shall see, there are few root page table references . Table 4 .3 displays the percent of memory references handled by each of the four cases shown in Figure 3 .4. Between go and gg% of the time, the cache hits, and reference is handled in one cycle (A). The SPUR in-cache translatio n takes ov~r on a miss, and in the next cycle reference s itself with the virtual address of the page table entry. From 0.5% to 7% of all references are cache hits on these reference s {B). The desired instructio n or data can then be fetched from m~mory in one bus transactio n.
For all the traces except ~1VS, only 2 or 3 references out of 10,000 miss in the "TLB" and thus go to memory for the page table entry (C). Only about 2 in 100,000 reference s take a "double-m iss," and require memory fetch of the root page Figure 3 .4. The average number of cycles per reference is gtven for each trace. A "$" represents a cache reference 1 cycle, and an "M" indicates a memory transactio n requiring 13 cycles. Table 4 .4 shows how the SPUR in-cache translation compares to the commercial translation buffers tha.t were presented in Table 2 .1. The SPUR method of translation displays consistently lower miss rates for all but one case (LISZT on the 470V /8). The more poorly-behaved the trace, the better the incache method does when compared to the commercial buffers. By allowing a large, variable number of entries tu be dedicated to translation, the SPUR scheme substantially outperforms even the large, set-associative buffers that use hashing.
Comparison to a Separate TLB
A good deal of the high performance displayed by the SPUR system could be because the translation is only being done on cache miss. All the commercial systems shown in the table translate on every reference. The following more direct comparison accounts for the presence of the large virtual cache. Table 4 .5 shows the performance of using a separate translation buffer placed after the SPUR cache. These figures were generated by simulating the performance of the 128K byte, direct-mapped cache for each trace, and recording only those addresses that missed. These were then used as the input to each of the simulated translation buffers. To get miss rates as low as those shown in Table 4 .2, a separate TLB would require over 256 entries for LISZT, 512 entries for V AXIMA, and over 102·1 entries for the context switching traces and :MVS. Even if the buffer were built to be 8-way set-associative, to do as well would require over 128, 512, and 1024 en tries, respectively.
The last. column in Table 4 .5 shows the aver a.ge r.umber of PTEs contained in the SPUR cache using in-cache translation. Since the separate TLBs are of fixed size, they require over twice the entries to do as well. The direct-mapped TLBs would need over three times as many entries. Although instructions and data can displace page table entries in the in-cache scheme, having a large variable number of entries appears to overshadow this. Table 4 .6 shows the commercial translation buffers examined beiore in Table  4 .4, but this time translating only when the SPUR cache misses. The "Cache Miss" columns have identical entries because the SPUR 128KB cache was simulated in each case. The cycles required for the average reference were ca:ula.ted as in Table 4 .3. The relative cost when compared with the cycles required for SPUR is displayed in the last column for both V AXTh1A and }vfVS.
The SPUR in-cache translation had lower TLB miss rates than all the buffers except for the Amdahl 470 V /8 when running .MVS. The V /6 and V /8 both do better than SPUR when taking into account the total machine cycles required. This is because of the high occurrence of PTE misses we observed in These are the same commercial translation buffers a11 in Table 4 .4. Here, however, they are placed after the SPUR virtual cache to show performance when translating only on cache misses. TLB miss rate and cycles required were lower for the SPUR in-cache translation method for all except the largest TLBs with MVS (shown in bolc..l). A.s before, only half the entries for the V A..X buffers were simulated and the IBM and Amdahl TLBs use a hashed index.
decrease in miss rate from the VAX 11/780 to the Amdahl 470 V /8. The same is not true for their performance after a virtual cache. Here, the 256 entry VAX buffers do as well or better than the two directly below them that use hashing. This suggests that hashing is less important for translation after a cache. The number of entries in the device is clearly the dominating factor.
Conclusion s
In SPUR, the desire for single-cycle cache access dictates that caches be virtually addressed. \Vithout modification s to the bus and memory, options for performing address translation at main memory are ruled out. This means translation must occur after the cache and before the bus. For the large virtual -%1-I I I I I address space and physical memory of the workstation, a translation buffer provides the most effective use of a "small" amount of mapping memory and a two-level page table scheme reduces the size of the full map in physical memory. Placing the page tables in the virtual address space allows the page table entries to be cached.
In-cache translation solves the problem of data consistency between multiple translation devices by using the cache's snooping protocol, thus avoiding additional hardware at the cost of complicating the cache-controlle r. Since the SPUR multiprocessor allows shared data to be cached, the cache controllers must already be complex enough to support a cache-consiste ncy protocol.
Translation buffers have been separate devices for purely historical reasons. Traditionally, translation bas occurred before (or in parallel with) referencing the cache. \Vith a virtual cache, there is no reason that the existing hardware cannot be used for both purposes. Since a small number of cache entries are needed to hold page table entries, the cache perforiP..s address translations with minimal effect on the normal performance.
The increase in cache miss rate due to in-cache translation has two components: references by the processor that miss because of added collisions due to PTEs, and references to PTEs by the cache that miss. Of these, the occurrence of PTE misses is much larger than the amount of instructions and data being displaced. \Vhen this PTE miss rate is compared against existing TLBs, it outperforms even large, set-associative buffers using hashed indexing. If a separate translation buffer were used, it would need to be over twice the size of the number of entries required on average by the in-cache method. Even if it were made highly associative, this would still demand 512 or more entries to do as well. A larger, variable number of entries outweighs the cost of additional cache misses incurred. See Appendix C for the effect of a different cache organizations on the SPUR in-cache method of translation.
Studying this form of translation was the result of the particular requirements of the SPUR multiprocessor . These results hold for uniprocessors as well. Perhaps a significant effect of this work will be in the area of low-cost computers. In the past, personal computers have rarely had cache memory because of the cost-performan ce trade off. \Vith the decline in memory costs, more small systems will begin to feature larger caches. VLSI now makes it possible to build a circuit with controller and cache tags on-chip. In effect, adding the control for address translation yields a TLB for free.
Acknowledg ements:
A number of people working on the. SPlJR project helped to make this research possible. The Dineroii cache simulator was written by Mark Hill and was later modified by David \Vood to measure in-cache translation. George Taylor and Robert Henry were responsible for gathering the VAX traces. Alan Smith provided me with the MVS trace. Professors David Patterson, John Ousterhout, and Randy Katz deserve special note for their advice and patience. Garth Gibson, Susan Eggers, Paul Hansen, Joan Pendleton, and Brent Welch also provided me with ideas and the necessary supportive harassment. In general, my thanks go to all the faculty and graduate students involved in the SPUR research group.
Append ix A Issues in the Locatio n of Addres s Transl ation
In deciding bow to support virtual memory, the issue arises of precisely where the use of virtual addresses ends and physical addresses begin. Figure A. l identifies five potential locations for address translatio n in the SPUR multiproc essor:
( 1) Before the Instructio n Buffer (IB) {2) Between the Instructio n Buffer and the cache (3) In parallel with the cache ( 4} Between the cache and the system bus (.5) Between the system bus and memory
Placing the translatio n before the Instructio n Buffer (ffi) (option (1)) has significan t disadvant ages. Address translatio n is required for every reference , reducing the advantag e of having the instructio n memory on-chip. The IB is likely to be small enough so that the low-order bits of the virtual address can be used to directly access the buffer. Recall that the low-order bits represent the byte within a page and therefore are not effected by translatio n. Since physical addresses are not needed at this point, we cannot justify dedicatin g processor chip area to translatio n hardware . The potential locations for address translation are shown: (1) before the instruction buffer {IB), (2) between the processor and the cache, (3) in parallel with the cache, ( 4) between the cache and the bus, and ( 5) between the bus and memory. Note that for simplicity only one of multiple processor-cache pairs is shown.
Alternativ ely, the translation can be done between the IB and the cache (option (2)). This is the approach taken in the VAX machines [DEC 81] , and allows the IB to be accessed without translation . However, data references and IB misses still incur the overhead of translation . Both this and the preceding option require doing the translation in series with the cache access. To keep the time for a cache reference to a minimum, we reject both alternative s.
Rather than doing the translation in series with a cache access, the third alternative is to do it in parallel (option (3)). Such a scheme reduces the cost of a cache reference, potentially to a single R.Al\.1 cycle, s~nce it is now not the sum, but the maximum , of translation and cache access times. The cache tag memory and the translation buffer are accessed with the virtual address in parallel. The resulting tag and physical page number are then compared to determine if the cache has a hit. This is the approach taken on many of the .JBM 370 mainframe s [Smit82] .
Since the cache is now referenced from the processor with virtual addresses, two complicati ons are introduced : synonyms and the need for reverse translation~. Synonyms arise when more than one virtual address refers to the same physical address. This complicate s cache consistenc y because there is no longer a straightfo rward mapping between virtual and physical addresses. For example, it is possible to have multiple copies of the same memory block stored in a cache with different virtual addresses. Reverse translatio n occurs when a physical address on the bus must be translated back into a virtual address. This is often done as a solution to the synonym problem, but arises in SPUR because of the need for the C"ache to snoop on the bus. Since the bus must transmit physical locations to memory, and the cache is referenced by virtual addresses, a reverse translation would hnve to be performed . Special mechanism s are required to do this mapping, such as reverse translation buffers or a fully-assoc iative organizati on for the cache tags.
It should be noted that if the size of the cache is small enough, only bits from the "byte-on-p age" field are needed to identify a set. Since this field does not change during translation , there is no need for reverse translation . Increasing the ?.ssociativi ty of the cache has the effect of reducing the number of sets for th~ same amount of cache. This explains why it might be advantage ous to build caches \vith set-associa tivity 0f 16 or more even though studies have shown that 8-way set-associa tivity well approxima tes full associativi ty [Smit78] .
Even if the cache must be addressed \Vith bits from the page-num ber field, reverse transhtion could still be avoided by requiring that physical page numbers match their correspond ing virtual page numbers in enough bits. For example, if only one additional bit beyond the byte-on-pa ge field were required to address the cache~ we could arbitrarily require that even numbered virtual pages be pbced only in even numbered physical page frames, and similarly odd pages could only be mapped to odd page frames. In general, a virtual page v.:ould be restricted to reside in a particular eet cf physical p~ge frames. Figure A. 2 shows a more realistic example of this set-associative page placement. The cache may now be referenced from either the virtual or physical "side" with no need for a reverse mapping. However, if programs reference only certain sets heavily, unnecessary paging may occur even though there may be available frames in other sets. If virtual and physical addresses are constrained to match on enough of the loworder bits, it becomes possible to map either address into the same cache locations. Howeyer, this restricts a physical page frame to hold only virtual pages whose :.:ddresses coincide on these low order bits. For example, a 64K byte (16 address bits) direct-mapped cache requires that the low order 16 bits oi the physical and virtual address match. Assuming a 4MB main memory (22 address bits) with 4K byte pages, virtual pages can be placed into one of 16 sets (selected by address< 15:12>) of 64 physical pages (selected by address<21:16 > ). This means that there are only 64 possible page frames for each virtual page, rather than 1024 (the total number of physical page frames).
The need for reverse translation s can be avoided more simply by placing both virtual and physical addresses on the bus. The caches are addressed from the system bus side by virtual addresses; physical addresses are used to access mai'l memory. This requires either a wider address bus or time multiplexi ng the addresses.
In the preceding three options, the translation must be done on every cache reference. For high performan ce, this requires that the translation mechanism be fast. A good deal of hardware and design effort must be spent Lo keep the mapping time down to one RAM access.
The fourth alternative translates only on a cache miss (option (4)). This is attractive since misses constitute a small percentage of all references . Thus, a slower mechanism built with less hardware can achieve the same effective access time as the previous, more costly mechanism s. The need for reverse translation s is still present and requires the same mechanism s z..s discussed for option (3). The Xerox Dragon [McCr84], a VLSI-base d multiproce ssor system with a similar architectu re as that described here, does its address translation s only if a cache misses. Reverse translation is handled by storing both physical and virtual page numbers for each block in the tag memory, and by providing a fully associative lookup from the system bus side. Even on a miss, a translation is not necessarily required: if the referenced word is on the same virtual page as some other word already in the cache, translation is avoided by using the physical page number stored with that word's block. This is an elegant solution, but requires a fully associative lookup and roughly twice the memory for cache tags. Both of these costs severely limit the amount of cache that can be provided on a single VLSI chip.
The final option is to do the address translation in the main memory system (option (5)); the system bus would then use only virtual addresses. This has the advantage of centralizin g the mapping hardware. The contention for this hardware would be no worse than that of main memory itself. There are, however, several disadvanta ges. First, the bus must be wider than a strictly physical bus to accommod ate the larger virtual address. Second, latency to memory must increase to allow for translation . For protocols in 'vhich the bus is "held," the bus will be busy for a longer period of time per reference. Since the bus is a critical resource in a tightly coupled multiproce ssor, this is likely to have a serious effect on performan ce. It might be possible to design the translation mechanism to work largely in parallel with the memory RAM access. However, we lose the advantage of being able to do it in the "leisurely" fashion discussed in option (3). By translating at the memory, reverse translation s are not needed, but synonyms could still present a problem. To simplify the cache consistenc y protocol, writable synonyms would have to be disallowed . This approach also requires the design of custom me:nory and I/0 controllers .
For high performan ce in the SPUR multiproce ssor, we chose to provide a vi;tual cache, and to do translation only on misses. This rules out the first three options. Prototyping constraints prevented the redesign of memory and 1/0 eliminating the last option. Translation is therefore performed at each processor after the cache. Both physical and virtual addresses are placed on the bus to eliminate the need for reverse translation, and synonyms are disallowed by requiring that two address spaces sharing common data use the same virtual addresses for that region. . Thi!l graph shows the advantage of larger page size in translation . Misses in the VAX translation buffer would drop from about 2.5 in 100 instruction s down to 0.5 ir: 100 ii the page site were increased from 512 to 4K bytes. The V AXIMA and MVS traces are used in the study of the SPUR in-cache translation , and show much higher miss rates. misses in the V ~X translation buffer would drop from about 2.5 in 100 instructions down to 0.5 in 100 if the page size were increased from 512 to 4K bytes.
These simulatious were performed with one half of the VAX translation buffer. The curve for MVS was generated without flushing to approximat e system half performanc e. The other curves are the result of flushing the process half every 20,000 references.
The VAXIMA and MVS traces are used in the study of the SPUR in-cache translation, and show much higher miss rates. These results further strengthen the assumption that the V AXIMA trace produces high enough miss rates to serve as a conservativ e measure of translation buffer performanc e.
Appendix C Sensitivit y of In-Cache Translati on to Cache Paramete rs
In the following studies, the parameters of the SPlJR cache were used as the nominal values: 128K byte unified cache, direct-mappe d, 32 byte block size, with a 4K byte page. Cache size, associativity, block size, and page size were then varied to examine the effect on the SPUR in-cache translation method. In the following graphs, the increase in percentage miss rate is plotted on a log scale on the vertical axis. This metric reflects both the collisions due to the presence of PTEs in the cache, and also the misses on the PTEs themselves. Figure C .l shows that even sizes that are small relative to the 128Kb SPUR cache, the translation performs well when compared with commercial translation buffers. For example, LISZT in a 16Kb cache incurs an increase in miss rate from 2.7% as a normal cache, to 3.2% with in-cache translation: an additional 0.5% misses. This increase in miss rate roughly halves when these smalier sizes are 16 . 00 ~ .. UR.Jn-:-.C,a che .. ,Tran~latio .p increased by a factor of two. The "knee" of the curve at 128K bytes is typical of observations that led to building the cache at this size.
In Figure C .2, both LISZT and :MVS show considerable sensitivity to associativity. The two context-switc hing traces, CSlOOK and CS20K display what appears at first to be odd behavior. At 2-way set-associativ ity, they do better than direct-mappe d, but as the cache approaches full associativity, it begins to cost more to do translation in-cache. This is due to the rate at which one reference stream collides with entries left over from the other stream's previous run. In set-as ;ociative caches, all entries are not replaced until every entry in each set has been indexed to. However, in a fully-:1ssocia ti·:e cache, all n entries will be replaced as soon as n new blocks have been brought in. The graph argues strongly for increasing the associativity to two-way. However, by building the cache as direct-mappe d, circuit area that would have been used ror multiplexing hardware was traded for more tag storage. .1ch less sensitivity to variation than the preceding graphs. For .MVS, the additional misses increase as the block size increases. On closer inspection, this is most dependent on the occurrence of PTE misses. The cases of PTEs colliding with instructions and data reach a slight minimum at 32 byte blocks. For the other four traces, both of these values steadily decline as block size mcreases. This suggests that 1tfVS references adjacent pages less frequently.
-33-2.000 
---······-----·--:---------------··-:·--·--------------:-------·----·----:
o
.oo4 -----····-----·--r-------·------·--1·-----------------! ·-----------·-----:
. . . . . . Figure C.4 also shows less sensitivity to variation than cache size or associativity. In general, as the page size increases, the translation is more effective. This is almost entirely accounted for by the number of PTE misses declining. The number of collisions of PTEs with instructions and data also drops, but to much less of an extent. For some reason, a page size of 2K bytes causes a higher rate of these collisions in the Vaxima trace.
