Abstract
Introduction
Modern CPUs often use large physically-indexed caches that are direct-mapped or have low associativities. Direct-mapped caches are preferred because they are faster than set-associative caches and can provide data earlier in the pipeline [1] . Physicallyindexed caches are popular because they avoid the complex alias problems and also greatly simplify the design of snooping hardware for multi-processor systems [2] . However, such caches do not interact well with virtual memory systems. The system may assign physical addresses to two actively referenced pages such that the two pages map into the same page in the cache, causing excessive conflict misses.
Page Coloring
Page coloring and other related schemes [3, 4, 5, 6, 7, 8] have been proposed to change the page placement in large physically-indexed caches in order to reduce conflict misses. When mapping a virtual memory page to a physical memory page, conventional sys-tems simply allocate the top page from the free page list. The above techniques select a page according to some criterion such that the page will cause fewer conflicts in the cache. Sometimes an active page has to be replaced in order to satisfy the requirement [8] .
Drawbacks of Current Page Coloring Solutions
Previous studies have shown that the page coloring systems implemented at the operating system level are efficient. Although these systems need very few while modifications in the operating system, there are some noticeable drawbacks:
1. Page coloring interferes with the page placement algorithms. For example, many page coloring schemes sort free pages into different free page lists according to their colors. When a program requests a new page, only the free page list of the desired color is searched. This effectively makes the mapping between virtual addresses and physical addresses setassociative instead of fully-associative [6] . If the list is empty, the system either replaces an active-page with a page of desired color, which increases swapping traffic, or uses an imperfect coloring algorithm, which may increase cache conflict misses.
2. Page coloring cannot use any dynamic referencing information to improve page placements, since most memory references are filtered by the cache and are not seen by the MMU (Memory Management Unit). For example, the OS cannot know exactly which page is accessed most recently if the MMU does not see all the memory requests.
3. Re-coloring pages at run-time by copying them to new physical pages results in run-time overhead.
The remainder of the paper is organized as follows. Section 2 proposes a new solution called colorindexed, physically-tagged caches that decouple the memory addresses from cache addresses to minimize conflict misses. Section 3 explains the design and algorithms in more detail. Section 4 discuss issues related to multi-processor systems. Section 5 describes our experimental methodology and metrics. Section 6 presents simulation results of cache performance for different algorithms. Section 7 discusses related work. We summarize the paper in Section 8.
Decoupling Memory Addresses from Cache Addresses
The problem with a traditional physically-indexed cache is that the physical memory addresses are also used to index the cache, as shown in Figure 1 . An improperly placed page (with an improper physical address) will end up in the wrong place in the cache, causing excessive conflicts with other cached pages.
If the OS tries to move data around so as to change the data locations in the cache, it will be in conflict with the VM (Virtual Memory) page placement decisions. Based on this observation, we propose to reduce the conflict misses by decoupling the memory addresses from cache addresses. By using different addresses for the cache and the memory system, the placement of a page in the cache is independent of its location in the physical memory. As a result, we can optimize the cache data placement to reduce the conflicts, and optimize the paging system to reduce swapping traffic. Optimizing one does not adversely affect the other.
Almost all modern computer systems that support virtual memory use TLB (Translation-Lookaside Buffers) to reduce the overhead of virtual-tophysical address translations. A TLB is a small cache (normally 32-256 entries) that holds the recent virtual-to-physical translations. Since the TLB size is small, its access time is very small. When the system needs to translate a virtual address to the corresponding physical address, the TLB is searched first. If a match is found (a TLB hit), the physical address is obtained from the TLB directly. Otherwise (a TLB miss), the page table is accessed and the TLB is updated with the new mapping.
In a physically-indexed cache, every memory access is translated by the TLB. The TLB is therefore an ideal location to implement our design. Figure 2 shows the basic design of our scheme. In addition to the regular fields such as the virtual page number, the physical page number, etc., each TLB entry has a new narrow (several bits) field called Cache Page Number, which is used together with the offset field to form a Cache Address. Thus the TLB translates the virtual address to two addresses: a physical memory address for accessing the main memory, and a cache address for indexing the cache. Both these translations are done in parallel. Since these two addresses are independent, the placement of data in the physical memory will not affect its location in the cache. OS/hardware can choose optimal addresses for caches to reduce conflict misses.
Note that the cache address is only used to index the cache to locate the appropriate cache set, so it does not need to be a complete memory address. The cache still needs the physical memory address for the tag comparison purpose. This design can thus be viewed as a color-indexed, physically-tagged cache, where the color is the cache page number assigned by the hardware and/or software.
The "cache page" field is very narrow. For example, if the cache is 64 KB and the virtual memory page size is 4 KB, then the cache has 16 pages of data, requiring only 4 bits for the "cache page" field. Consequently, the hardware overhead of this scheme is minimal. More importantly, since our design does not require more TLB entries, the TLB access time is not affected. Compared to traditional page coloring techniques, the proposed architecture has the following significant advantages:
Since memory addresses and cache addresses are decoupled, they will not interfere with each other. The hardware/software can assign an arbitrary cache page address to a physical page to minimize the conflict misses, regardless of its physical memory address. As a result, we can easily implement a real dynamic perfect page coloring system. Since the placement of a page in the cache is independent of its location in the physical memory, the OS is free to use any page placement algorithm to optimize swapping traffic.
The TLB sees all memory requests, including those that are hits in the cache. For example, the TLB knows exactly which page is accessed most recently. Such dynamic information can be exploited to further reduce the conflict misses.
The scheme requires very little hardware overhead. Only several extra bits for each TLB entry are required for the simplest implementation. Some additional hardware can be added to exploit the dynamic reference information and improve performance. More importantly, the TLB access time will not be affected.
Our design carefully allocates cache pages only on page faults in order to minimize the conflict misses. This approach has very low overhead. There is no run-time overhead of copying pages to different locations, which is a drawback of other software-based approaches.
Our extended TLB records mapping information for both caches and memory. It is thus important to save both the physical page numbers and the cache page numbers in the page table. If we do not save the cache page number information, when a TLB entry is replaced on a TLB miss, the cache mapping information in the entry will be lost.
If we use a dynamic algorithm in which the cache color of a page is not determined by its virtual page number only, then the next time the page enters the TLB again, a new cache page number may be allocated. At the same time, the page may be cached in another location of the cache because of the last mapping (which information has been discarded), causing a consistency problem.
There are many systems that allow certain kernel accesses bypassing the TLB and accessing physical memory directly [9, 10] . This is called KSEG address. We can easily disable TLB bypassing [11] or use physical address as cache address for cached KSEG addresses.
Placing Pages in the Cache

Raw Page Coloring
This scheme tries to emulate the traditional page coloring scheme in which the physical page and virtual page always have the same color. This algorithm tries to exploit the continuous address spaces of the programs. However, it is not possible to implement such a perfect algorithm in a traditional softwareonly system without incurring increased page faults. Since the cache addresses and physical addresses are decoupled in our new architecture, however, it is very easy to implement a perfect raw page coloring scheme by simply assigning the lower-order-bits of the virtual page numbers to the cache page number field. There will be no consistency problems.
LRU Algorithms
The new design is implemented in the TLB, which sees all memory references including those that hit in the cache. Such dynamic reference information can be used to improve the performance.
One such approach is the LRU (Least-RecentlyUsed) algorithm. On a page fault, the algorithm tries to allocate the least-recently-used color as the color of the new page. The rationale is that such a color is not actively used by other pages. Therefore the color should be assigned to the new page to minimize the conflicts.
In most systems, LRU algorithms can be implemented with no extra hardware. Most systems allow the OS to read the content of the TLB. Since TLBs are often fully-associative, the LRU entry can be obtained or inferred by the OS. If the TLB content cannot be read, or if the TLB is not fully-associative, then we can maintain a time-stamp for each color. The corresponding time stamp is updated on each TLB access. On a page fault, the list is searched, and the color with the largest time-stamp is the LRU color. Since the number of colors is small, and the search takes place only during page faults, the overhead is small.
LFU Algorithms
Another algorithm that benefits from the dynamic reference information is LFU (Least Frequently Used). The TLB maintains a list of possible colors, each has a counter attached. On a memory reference, the counter of the corresponding page color is incremented, in parallel with the TLB translation. When a re-mapping occurs (on a page fault), the operating system searches the list to select the color with the smallest counter value as the target color. Since the list is short (a cache of 128KB requires a list of 16 entries, if the page size is 8KB), the overhead is small.
In order to keep the counters to reflect recent page access behavior and limit the size of the counter, the system resets the counters periodically. If the frequency is too high, LFU cannot generate enough information and some of the counters will remain zero all the time. If the frequency is too low, the reference information maintained by the counters may be out-dated. We tested different refresh rates in the simulation.
Another important concept is the cursor. In LFU, when looking for a color with the smallest counter value, the system may find that many colors have the same least counter value. A trivial implementation may select the first color with the least counter value. However, in such cases many pages will be allocated with colors at the beginning of the color list, while the colors at the end of the list are under-utilized. We found that such a strategy causes significant overhead for some applications.
The remedy is to use a "Cursor", which is a pointer to the currently-selected color. A search always starts from the Cursor. The Cursor effectively adds a round-robin feature to LFU.
Multi-processor Support
Cache-coherent shared memory multi-processor systems normally use physically-indexed caches because it is much easier to implement snooping protocols based on physical addresses [2] . When the snoop hardware sees a memory transaction on the address bus (which uses physical addresses), it uses the memory address of the transaction to check the local cache. However, since our scheme changes the locations of data in the cache (data are not indexed by their physical addresses anymore), the snoop hardware will not be able to locate the data correctly.
The problem is very similar to that faced by virtually-indexed caches. The solutions are also similar. For example, we can let the lowest-level cache (say level-3 caches) be physically-indexed and contain back-pointers to higher-level caches (which use our design for better performance), similar to the method of Virtual-Real Caches [12] .
Another method is to let the system send the cache page address together with the physical address to the address bus. This requires several extra bits (the cache page address is very narrow) on the address bus. A similar method is used in virtualindexed caches [13] . When a cache sees a request on the bus, it can use the cache page address and the physical address together to correctly locate the data in the cache. The OS needs to make sure that all TLBs are consistent: the same physical pages have to map to the same cache pages. This does not require additional work on original TLB shootdown procedure [14] .
Experimental Methodology
We conducted trace-driven simulation experiments to evaluate the performance of the proposed architectures. Our main performance metric is cache miss rate. We simulated physically-indexed, level-1 data caches. The cache sizes are varied from 64 KB to 256 KB. We simulated both direct-mapped and 2-way set-associative caches with our schemes. The block size is 16 bytes, unless otherwise noted.
We used SPECint2000 traces for simulation. These traces were collected with the ATOM tool running on Compaq TruUnix. They contain virtual addresses only. In order to evaluate normal physicallyindexed caches, we simulated a paging system with a page size of 4 KB. The physical memory size was 500MB. A clock page replacement algorithm was used. Each trace contains 100 million data references. Since register-to-register instructions do not generate data memory accesses, the actual numbers of instructions executed are much larger. We skipped the initial 30 million entries of each trace (to skip the initialization code) and simulated the remaining 70 million entries. We found that the simulation results stabilized after 20 million entries. Table 1 shows the different simulation configurations. In order to compare our scheme with traditional systems without page coloring, we used PhyDirect, Phy-2-Way and Phy-Fully-Assoc, which are physically-indexed caches with different associativities. To compare our design with the traditional page-coloring algorithms, we used V=R, which is a perfect static algorithm that always assume physical addresses equal to virtual addresses. Note that this is an optimistic assumption that is impossible to achieve in the real system, especially in a multiprocessed environment.
Raw, LFU1, LFU2 and LRU are our decoupled designs with different color assignment algorithms. All these designs use a direct-mapped cache. 
Simulation Results
We present our simulation results in this section. In order to easily compare performance among various configuration, we chose Phy-Direct (physicallyindexed, direct-mapped caches) as the baseline system. We will show relative miss rates of these configurations compared to the baseline system. Note that in the following figures, the shorter the bars, the better the performance. We do not include the results for Raw, as its performance is identical to that of V=R. However, in real systems we will expect that Raw outperforms V=R, as the later is not a feasible algorithm for traditional systems. Figure 3 shows the breakdowns of different miss types of direct-mapped caches of three different sizes. This figure clearly shows that conflict misses dominate the total misses. It is therefore essential to reduce conflict misses for high-performance systems.
The Breakdown of Different Miss Types
Performance Under SPECint2000
Workload Figure 4 shows the relative performance of different systems under each individual SPECint2000 applica- tions. The overall performance (the arithmetic average) of different schemes is shown in Figure 5 . It is clear that overall, the traditional page coloring algorithm (V=R) performs better than the baseline system Phy-Direct. However, here we assume an unrealizable perfect coloring algorithm. Real systems may not perform as well (or may suffer from high page fault rates). The Raw scheme, which is one of our designs, will have performance identical to that of V=R but the former is realizable (actually it has the simplest implementation).
When the cache size is small (64 KB), the cache has only a small number of pages which limits the performance of our new schemes. As a result, the four decoupled configurations perform better than Phy-Direct and V=R but worse than Phy-2-way. However, this does not imply Phy-2-way has better performance, because it has a longer access latency than direct-mapped caches used by the decoupled systems.
When the cache size increases, the performance of decoupled schemes improves significantly because more cache pages are available to map. As a result, the new schemes (especially LFU1 and LFU2) perform much better than the traditional page-coloring algorithm and close to that of Phy-2-way. At the same time, the decoupled systems still have lower access latency than Phy-2-way.
Relative Performance of Different Cache Sizes
In Figure 6 we compare the relative performance of different schemes under different cache sizes, using the 64KB Phy-Direct as the baseline system. The figure clearly shows that with our new scheme, a directly mapped cache can achieve much lower miss ratios than a cache which is twice as large. Since modern CPUs devote up to 80% of the chip transistors for caches [15] , our technique can be used to significantly reduce the cache sizes without sacrificing performance. While a two-way set-associative cache can have a similar effect on reducing miss rates, they may not necessarily have higher performance because of the increased cache access time.
Our schemes, on the other hand, do not affect cache access times.
Impacts of Different Cache Line Sizes
The results presented so far are based on a small cache line size of 16B. To study the impact of different cache line sizes, we studied the system with line sizes of 32B and 64B. Figure 7 shows the results for individual SPECint2000 programs and Figure 8 shows the average results. Set-associative caches and our schemes can only reduce conflict misses. As the line size increases, the conflict miss ratio increases because less number of cache lines are available. As a result, the effectiveness of 2-way set-associative caches, as well as our schemes, increases. In fact, when the cache line size is 64B, two of our schemes,LRU and LFU1, show much better performance than Phy-2 Way.
Using Decoupled Schemes with SetAssociate Caches
Our decoupled schemes are primarily designed to work with direct-mapped caches because of their low access latencies. In order to investigate if our designs also work with set-associative caches, we simulated 2-way set-associate caches using our schemes and the traditional page-coloring technique. The results indicate that our schemes also work well for set-associate caches. A color-indexed, twoway set-associative cache performs much better than the baseline system (a physically-indexed two-way cache). In fact, our techniques perform very close to that of fully associative caches most of time.
Performance under Multi-processed Workloads
The results presented so far assume a single process environment. We also studied the proposed architecture under multi-processed workloads that include context switches in order to obtain more realistic results. There are infinite number of combinations of different programs. We randomly picked up nine groups of combinations and simulated their cache performance. Each group includes two to four SPECint2000 programs as listed in Table 2 . We assume that the system runs all programs in a group concurrently. A round-robin scheduling algorithm is used and a context switch occurs for every 5 million instructions. Figure 10 shows the performance of individual groups and Figure 11 shows the average relative conflict misses. Our new designs show even more promising performance here. For example, LRU is able to remove 99% of conflict misses. The traditional page coloring algorithm V=R does not perform so well, because addresses from multiple processes interfere with the page coloring algorithms.
Related Work
As discussed previously, page coloring and other related schemes [3, 4, 5, 6, 7, 8] have been proposed to reduce conflict misses in large physicallyindexed caches. Page coloring algorithms fall into perfect and imperfect colorings [4] . Perfect coloring requires the allocation paradigm hold without fail. Since it may require excess pages of specific colors, page faults may increase due to contention on a single color which might not have existed with monochrome memory. Imperfect coloring mitigates the situation by selecting an available page of another color if no more pages of the requested color are available. However, the cache performance may suffer.
Page colorings can also be classified as static page coloring and dynamic page coloring [4, 16] . Static page coloring, or just page coloring [3] , is based on Sites and Agarwal's comparison of the performance of virtually-indexed caches to physicalindexed caches [17] . They found that physicallyindexed caches perform worse than corresponding virtually-indexed caches unless there are frequent context switches. A static strategy maps a virtual page to a physical page with the same color. Dynamic coloring algorithms typically use "Bin Hopping" or round robin coloring. These algorithms map virtual pages to physical pages successively in page colors. Kessler et. al. [3] also proposed two other dynamic implementations called "Best Bin" and "Hierarchy".
Bershad and Romer et al. proposed to monitor cache conflicts at run-time using special or standard hardware [18, 19] . Once a large number of conflicts is detected, the page that causes the conflicts can be recolored by copying the page to a new physical page with a different color.
In addition to various page coloring techniques discussed so far, there are many other attempts on reducing conflict misses in direct-mapped caches.
Miss Caches and Victim caches [20] use a small, fully-associative cache in conjunction with a regular direct-mapped cache to remove conflict misses. Studies show that they are effective when the cache is relative small. Their performance is limited for large caches and their hit-rates are not as good as two-way set-associative caches [21] . Moreover, the schemes require two or more cycles to access a conflicting datum.
The Hash-Rehash Cache [22] and Pseudo Associative Caches [21, 23] try to use a direct-mapped cache to emulate set-associative caches by looking up data in another location in the cache on a miss. However, such designs have not been adapted in practice so far because they complicate the pipeline design by introducing non-constant cache hit times [24] .
Some researchers [25, 26, 22, 27] proposed the using of unconventional indexing schemes, such as hashing or XORing, to reduce conflict misses. Using a good hashing algorithm can randomize cache placement, leading to better conflict resistance [25] . Smith [28] found that the advantages of random indexing were not significant. However, a recent paper [25] shows that for certain applications and cache organizations, the advantages can be very large.
By using different hashing functions for different cache banks, a two-way skewed-associative cache [29] can achieve the same hit ratio as of a fourway set associative cache with the same size. But the cost of formal may be less. Skewed-associative caches are aimed to improve the performance of set-associative caches, while our schemes can improve the performance of both direct-mapped and set-associative caches.
Conclusions
Large physically indexed caches are popular in modern CPUs. However, they do not interact well with virtual memory systems. Since the physical memory addresses are also used to index the cache, an improperly placed page will end up in the wrong place in the cache, causing excessive conflicts with other cached pages.
Page coloring has been proposed to reduce the conflict misses by carefully placing pages in the physical memory. However, while page coloring works well for some applications, many factors limit its performance. Page coloring also limits the freedom of page placement system and may incur high swapping traffic.
In this paper, we propose a novel and simple architecture called color-indexed, physically tagged caches that can significantly reduce the conflict misses. With some simple modifications to the TLB, the new architecture decouples the addresses seen by the cache from the addresses seen by the main memory. Since the cache addresses do not depend on the physical memory addresses anymore, the system can freely place data in any cache page to minimize the conflict misses, without affecting the paging system. We have shown that the design can be used in both single-processor and multi-processor systems.
We studied four different configurations of the new architectures. Their main differences lie in the way they assign cache colors on page mapping. We performed extensive trace-driven simulation using the SPEC2000 benchmark suite to evaluate the proposed architecture. Our results show that the new design performs much better than traditional page coloring techniques. With the new scheme, a directmapped cache can achieve hit ratios very close to those of a two-way set associative cache. In addition, the new design, based on direct-mapped caches, has shorter access latencies than set associative caches. Results of multiprogramming workloads are even more promising.
Since a direct-mapped cache with our scheme consistently achieves better performance than a traditional cache twice as large, our design can be used to significantly reduce the cache size without sacrificing performance.
Our design can also be used to improve the performance of set-associative caches. For example, a two-way set-associative cache using our schemes performs very close to a fully-associative cache. 
