As processor performance continues to improve, more demands are being placed on the performance of the memory system. The caches employed in current processor designs are very similar to those described in early cache studies. In this paper, a detailed characterization of data cache behavior for individual load instructions is given. It will be shown that by selectively allocating cache lines according the characteristics of individual load instructions, overall performance can be improved for both the data cache and the memory system. This approach can improve some aspects of memory performance by as much as 60 percent on existing executables,
INTRODUCTION
The average data access time is a measure of the time it takes to read a data item from memory. Since most programs need to access data, minimizing time to off-chip memory (measured in processor clock cycles) has increased dramatically as the disparity between main memory access times and processor clock speeds widen. This disparity is likely to continue to grow, since there is no indication that dynamic memory access times will decrease significantly in the near future.
In order to minimize the impact of slow main memory access times, several strategies have been used. Most machines now include a first-level cache, which is designed to reduce the average data access time by capturing the most frequently used data items. If necessary, a second-level cache can also be added to the system-since the second-level cache will presumably be smaller than the main memory, it can be built using faster and more expensive logic.
Another option is to interleave main memory, so that each word of a cache line does not have to experience the full latency of the main memory. Widening the bus between the primary cache and the main memory (or second-level cache) is also an option. Both of these approaches have the effect of increasing the bandwidth of the data flowing across the chip boundary. The effectiveness of these strategies depends on how easy it is for designers to increase the number of pins on a chip and/or increase the rate at which these pins are driven.
Data cache cycle times have been able to keep pace with the clock cycle of the machine, lock-up free caches designs have been used, more pipelined cache designs are beginning to appear, and a number of schemes to deal with writes so that data writes do not slow down the normal operations of the cache have been developed.
All these traditional approaches help decrease the average cache access time, but they do not fundamentally change the way in which the cache operates. The placement and replacement strategies are essentially the same as when caches first appeared. As multiple-issue processors continue to increase the number of instructions that can be issued each cycle, there will be a corresponding increase in the demands placed on the bandwidth to the data memory--the data cache in particular will be hard-pressed to service more than one data reference per cycle.
Since it is not clear that traditional methods of reducing the data cache miss rate and miss penalty will be sufficient, we believe that a somewhat different approach is warranted. In this paper, we examine the potential of reducing the average data access time by dynamically deciding whether to cache a particular data item based on the address of the load instruction generating the request. The techniques we will be describing are largely orthogonal to standard miss rate/miss penalty reduction techniques, and should work well in conjunction with improvements made on other fronts.
BACKGROUND
Caches are a very well-studied and well-understood tool used to reduce the average memory access time. In this section we will briefly summarize the aspects of cache behavior relevant to this study.
In a system with a cache, the average access time for a memory reference is a function of the hit rate in the cache, the corresponding miss rate and the miss penalty. From Ref. This equation shows that in order to minimize the average access time, the hit rate should be maximized (thereby minimizing the miss rate) while simultaneously minimizing the miss penalty.
Reducing Cache Misses (Miss Rate)
Cache misses can be categorized into three types of misses: Compulsory, Capacity and Conflict. I~ [The cache design will determine the relative weight of each type of miss on the overall cache miss rate.] Compulsory misses are those misses that are initially experienced when a cache is being filled (often called cold start misses), and are very difficult to eliminate. A capacity miss occurs in a cache when more active data items exist than the cache can encompass. These misses can be reduced in number by either increasing the number of lines in the cache or by increasing the size of each line in the cache (or both). A conflict miss may occur in cache designs that restrict the placement of data items in the cache, when more items map to the same cache set than the associativity of the cache supports. In order to reduce the number of conflict misses, the design should relax the restrictions on the placement of a line in the cache. Conflict misses do not occur in a fully associative cache, but become the dominate miss category in the large direct mapped caches commonly used. Generally, increasing the associativity of the cache reduces the frequency of conflict misses.
The miss rate is best reduced by increasing the size and/or the associativity of the cache. Unfortunately, in the design of today's highperformance processors, it is difficult to substantially increase either of these terms because the access time of the data cache must first and foremost match the clock cycle time of the processor. Several studies have investigated the relationship between cache access time, cache size and cache associativity. (2' 3) These studies carefully parameterized a hardware model of the components of the cache (such as data array, tag array, compare logic, bus delays, etc.) and found, for example, that going from a direct mapped to 2-way associative cache substantially increases the access time to the cache. A similar conclusion can be made when increasing the primary cache size much beyond 16 K bytes.
Reducing Miss Penalty
The miss penalty accounts for the time it takes to read a cache line from (or write a cache line to) the next level in the memory hierarchy. This penalty has been climbing as the disparity between main memory access times and processor clock speeds widen. Especially in processor designs with on-chip caches, access time to off-chip memory (measured in processor clock cycles) has increased dramatically.
A number of studies have proposed techniques (either compiler-based or hardware-based) that reduce the miss penalty of the cache by performing some type of data prefetchJ 4 6~ If data items are prefetched during idle data cache cycles, references to a prefetched item will find it already in the cache and thus will not cause a miss and the associated miss penalty will not be experienced. An example of hardware-based prefetching is the work by Chen and Baer. ~6~ In this paper the authors propose keeping a history of the strides of data references, and using that information to make predictions as to what should be prefetched. IBM uses a similar hardware approach/7) in which they associate previous miss behavior with a load instruction and use that information to do prefetching.
A somewhat different hardware approach to reducing the miss penalty is put forth by Ref. 8 . The author makes the observation that a cache or a degree of associativity that is too small will lead to a substantial number of conflict (or capacity) misses, and that there is a good chance that the line that is selected for replacement will be needed again soon. Therefore, he proposes the use of a small, fully associative buffer sitting at the back end of the cache (called a victim cache) to buffer these replaced lines. By keeping these recently replaced lines in close proximity to the data cache, subsequent references to them will not experience the full main memory miss penalty.
Among the most intriguing software approaches to reducing the miss penalty is a study by Abraham, et al. 19) in which they observe that a very small number of load instructions are responsible for causing a disproportionate percentage of cache misses. By using profiling techniques similar to those used to schedule code for VLIW machines, the compiler can accurately identify those data reference instructions which cause the highest data cache miss rates. By recognizing these data references and using special instructions to control the cache, the software can effectively prefetch those instructions and reduce the miss penalty.
DECIDING WHAT TO CACHE
Instead of concentrating on miss penalty, cache size or cache associativity, we decided to look at the source of cache misses. One could potentially reduce the miss rate of the data cache by simply not caching those data references that lead to a high miss rate. As Abraham et al. (9~ point out, a large percentage of the data misses are caused by a very small number of instructions. Instead of using this information to make prefetching decisions, we decided to look at the impact on the data cache miss rate if the data cache is smarter about what it decides to cache and does not allow these troublesome instructions to allocate space in the data cache. Such an approach has the potential to more effectively utilize the cache because instructions that generate a large number of cache misses are removing more heavily utilized data items from the cache. In addition, if we do not cache the data associated with high-miss-rate instructions, memory bandwidth requirements could be reduced since these references would only request a single word from memory, instead of an entire cache line.
Since the study by Abraham et al. ~9~ did not look at an extensive set of benchmark programs, we began by performing experiments similar to theirs in which we measured the miss rate associated with individual load and store instructions for a more extensive set of programs. Using the ATOM program trace facilities ~j~ and the SPEC92 suite of benchmarks, such statistics were relatively straight-forward to gather. Each program in the SPEC 92 suite was instrumented in order to track the data cache hit rate associated with each unique data address. We simulated a 32-byte line size, and both 8 K-byte and 16 K-byte caches, which were direct mapped, 2-way set associative, 4-way set associative, and direct-mapped with a victim cache. Table I presents a detailed breakdown of each benchmark analyzed, the input that was used, the total number of load data references and the hit rates for each of the cache configurations simulated. Our results are not surprising, and match those from many other studies--as expected, a direct mapped cache performs the worst in general while increasing the associativity improves the hit rate.
An examination of the table reveals that using a 16 line victim cache in conjunction with a direct mapped cache provides cache hit rates that generally lie somewhere between the 2-way and 4-way associative configurations. In fact, in several benchmarks a considerable improvement over 4-way associativity is demonstrated (e.g., a jump from 64% to 84 % in su2cor). This indicates that the number of conflicting items in the direct mapped cache is small but can be clustered in a single cache line. By allowing the associativity of the victim cache to be directed at those contention spots, performance can be improved using less hardware. In order to better understand what is causing the cache misses we looked at the reference pattern of each program in greater detail. Table II presents the cumulative percentage of data references and data cache misses caused by the most heavily executed load instructions in an 8 K-byte direct mapped cache. Each row of the table contains the information gleaned from a run of the given SPEC benchmark, and the 8 columns which are labeled with a percentage of total dynamic references and total data cache misses each have two sub-columns indicating the total number of static instructions that caused that percentage and the percentage of the total number of instructions that represents.
If we look at the compress benchmark, for example, we see that 34 static instructions are responsible for 75 % of all load references and those 34 instructions account for 1.11% of all load instructions in the benchmark (e.g., there are 3058 load instructions in compress and 34/3058 = 1.11%). Continuing across the table, we see that 54 (1.77%) of the load/store instructions account for 95 % of the data references. The remaining 98.23 % of load/store instructions generate only 5% of the data references. This demonstrates a well known principle of program execution, that a small portion of the program is responsible for much of the execution effort. This is an effect of the 90/10 locality rule which states that a program spends approximately 90 % of execution time in only 10% of the code.
Given that a small number of instructions are responsible for the majority of data references, it is reasonable to expect that this same effect would be reflected in the distribution of cache misses. This is also shown in Table II --overall, we find that not only does the 90/10 law still hold, but the miss pattern is even more clustered than the overall reference pattern. For almost all benchmarks, less than 5 % of the total number of load instructions are responsible for causing over 99 % of all cache misses.
The data in Table II makes it clear that in general a small number of load/store instruction have a disproportionately large effect on the cache miss rate when compared to the number of total data references they generate. This is not all that surprising if one considers program behavior. References to global variables and to local variables (even if they reference the procedure call stack) can account for the data references with a large hit rate, especially if one considers the looping behavior of programs. Examples of references that generate low hit rates would include references to items in a linked-list, or traversing through an array with a long stride. 
ANALYSIS OF CACHING POTENTIAL
Given that a small number of load instructions are responsible for generating the majority of data cache misses, we decided to measure the cache hit rate and the corresponding memory bandwidth required if these troublesome load instructions were prohibited from allocating space in the data cache. In order to accomplish this, we examined the cache behavior of each load instruction and identified the ones with the lowest cache hit rate. These were marked C/NA (Cacheable/Non-Allocatable), which means that the data references generated by these load instructions will not invoke the allocation policy of the hardware cache management algorithm. It does not mean that the referenced data will not be in the cache--the data item might be in the cache if a different instruction, that will allocate on miss, references that address.
It is important to stress that we are deciding whether or not to allocate based on the instruction address, not the effective address of the data reference. Thus, a cache lookup for an item is unaffected by whether it is marked C/NA or not only the allocation on a misses is affected. We looked at both static (similar to Ref. 9) and dynamic approaches to identifying and marking these C/NA instructions.
Static Method
We began by modeling a simple strategy in which all load instructions that do not meet a threshold for cache hit rate are marked C/NA. We looked at several threshold values, balancing the desire to remove poorly performing loads with the conflicting desire to utilize the cache for as many references as possible. We finally settled on a threshold value of 75 % (instructions that cause a miss more than 75% of the time are marked C/NA). This number was chosen for a number of reasons. A lower value proved too aggressive in removing load references from using the cache, and a higher value did not remove a sufficient number of load instructions to help performance. Furthermore, the 75 % threshold also relates to the memory bandwidth requirements for a cache line replacement (32 bytes) and a 64-bit load reference (8 bytes), and is the same value settled on by Ref. 9. Table III shows the change in cache hit rate and required memory bandwidth after the poorest performing instructions were marked C/NA. Column one contains the name of the benchmark program and column two shows the range of instructions that were made C/NA (since the count of instructions varied depending on the cache configuration). Columns 3-7 show the change in hit rates (compared to the entries in Table I ) for caches that are direct mapped and 2-way set associative. As can be seen in Table III , there was a uniform slight decrease in the hit rate across all configurations.
1.1. Cache Hit Rate and Memory Bandwidth Utilization
A potentially more meaningful measure of the demands made on the memory system is to determine the total amount of data (in bytes) that must be fetched from the memory system. Since we used a system configuration in our simulations similar to that of the Alpha (32 byte cache lines and single references being 8 bytes), we were able to determine the total number of bytes that the memory system must process and the impact of these C/NA transformations on the bus activity.
We calculated the total bus utilization for the Static case by multiplying the number of allocatable misses by 4 (32/8), and adding the number of references to instructions marked C/NA. Dividing this number by the base case bus utilization allows us to calculate the percentage change in the bus bandwidth needed by the Static approach. The results of these calculations are shown in the last 4 columns of Table III . So, for example, after the C/NA transformations the compress program run on a 16 K-byte 2-way set associative data cache requires 61.62% less bandwidth than that required by the same program run on the same hardware without the C/NA transformations.
The table shows a significant overall decrease in the required memory bandwidth. In particular, it shows that the static scheme used in conjunction with an 8 K direct mapped cache results in an average decrease in bus activity of approximately 30 % for both the integer and the floating point programs.
1.2. Memory Activity
Another important measure of the effectiveness of this technique is the amount of memory traffic that ensues. This information is shown in Table IV . There are 4 classifications of load instructions shown in Table IV: Caeheable/Non-Alioeatable--those load instructions that have been identified as C/NA.
Increased load instructions that are cached and have a higher miss rate because of the transformation.
No Change--load instructions that are cached and maintain their original cache hit activity.
Decreased load instructions that are cached and have a lower miss rate because of the transformation. In order to reduce the tremendous amount of data generated, we show information that has been averaged over all benchmarks for an 8 K-byte direct mapped cache.
In order to better understand what is happening, imagine a situation where items A and B both map to the same cache line and are repeatedly accessed. In this case each reference will experience a high miss rate. However, by prohibiting one of these items (A, for example) from allocating the cache line on a miss, the remaining item (B) will experience a much lower miss rate due to the elimination of contention. This effect is shown in the Decreased field of the table.
On the other hand, some items with a high miss rate actually perform a useful function by bringing a line into the cache that will be later referenced by other load instructions. By eliminating the cache line allocation of these instructions the cache hit performance of these other loads is decrease~this is reflected in the Increased field.
The first column of Table IV shows the load instruction classification. The second and third columns show the average number of load instructions and the percentage of the total load references these instructions perform, respectively. The fourth column contains the average number of references to memory (the number of cache misses) that occurred before any loads were marked C/NA. The fifth and sixth columns show the number of memory references after the C/NA transformations were performed and which instructions were responsible for the references.
In Table, IV we see that the total number of memory references has increased by over 29 %. This is due in large part to the 351% increase in the number of cache misses experience by 443 of the non-C/NA load instructions. This approach is apparently being too aggressive in marking loads C/NA--by blindly removing those loads with poor performance, we are often simply shifting a miss from that instruction to the next instruction referencing that location. Clearly, a more refined approach to marking certain high miss load instructions C/NA is called for.
Improved Static Method
In order to improve the performance of the simple static technique, the number of instructions marked C/NA had to be reduced. This was accomplished by associating with each cache line the address of the instruction that was responsible for bringing that line into the cache. This information allowed us to distinguish between misses that bring data into the cache that is later referenced (performed a useful prefetch) and those misses that are not referenced before the data is returned to memory due to the cache replacement strategy. Only instructions that do not perform a useful prefetch are marked C/NA. We refer to this as the Improved Static Method.
In our simulations, this modification to the static approach was implemented in the following manner: We used the same 75 % hit rate threshold to identify potential C/NA instructions. Once these were identified, they were analyzed to determine if they were performing a useful prefetch. If at least 3/4 of the misses prove to be prefetches, then the instruction was removed from the C/NA list, resulting in a less aggressive application of C/NA.
1. Hit Rate and Memory Bandwidth Utilization
As shown in Table V the Improved Static approach provides hit rates very close to those presented in Table I . Cache performance was only slightly worse for the both the integer and floating point benchmarks (on average).
Table V also shows how the improved Static scheme affects the bus bandwidth. Table V shows that the Improved Static scheme consistently reduced the memory bandwidth requirements over the original Static scheme. This was achieved by reducing the memory requirements for more than 1/2 of the cache misses, those that did not allocate a new cache line.
In particular, Table V shows that the improved static scheme used in conjunction with an 8 K cache results in an average decrease in bus activity of approximately 30%, and by more than 50% for 5 of the programs.
Memory Activity
An examination of the memory activity shown in Table VI reveals several interesting observations. For example, the number of instructions in the C/NA class dropped from 351 to 187, indicating that there are a lot of instructions with high miss rates that are actually performing useful work (prefetching). As one might expect, the increase in memory activity due to the C/NA instructions dropped as well. However, the most dramatic change is in the number of instructions that have their miss rate increase--this drops from 443 to 307, resulting in a reduction in memory activity from 351% to 62 %.
The most significant number in Table VI is the total change in memory activity. This shows that by applying the improved Static method to a program the cache hit rates can be maintained while simultaneously decreasing the amount of traffic to memory.
DYNAMIC CACHE MODEL
It is clear that the use of the improved static approach will improve data cache performance. However, the static approach requires training runs of the program, and the introduction of new instructions in order to specify the alternate cache operation. Both of these factors markedly decrease the applicability of this approach. Our goal is to develop a scheme that will provide the same performance enhancement transparently. In order to select which items should be marked C/NA, we turn to the body of work on branch prediction strategies. There has been a great amount written about branch prediction strategies recently/~1-17~ Briefly, dynamic branch prediction strategies collect runtime information about branch behavior to predict whether a branch will be taken in the future. Typically, these strategies associate several bits of information with a branch instruction that track the past history of the branch instruction. This information is updated each time the branch instruction is executed and is used to make a prediction about the branch instruction's behavior.
In a similar way, several bits can be associated with a load instruction. A table, similar to a branch prediction table, can be maintained which tracks whether the data referenced by a load instruction caused a miss in the data cache. This information can then be used to decide whether an instruction should be marked C/NA.
Dynamic 2-bit Counter Scheme
In our study, we simulated miss prediction tables using a 2-bit counter associated with each load instruction. A miss prediction counter is initially set to zero and it is incremented each time a load causes a cache miss. If the load instruction causes a cache hit, the counter is decremented. When the counter enters its highest state ("11"), the instruction is marked C/NA.
It is worth stressing again that the counters simply inform the cache allocation hardware whether the data should be placed in the cache on a miss. Regardless of the state of the counters, a data cache lookup is performed on every data reference, since the data may have been brought into the cache by some other instruction. Thus, there is the possibility that in one phase of program execution an instruction will be prevented from caching its data, but in a different phase of the program it will be allowed to do so.
To develop a more transparent scheme, we looked to previous work done in branch prediction. In Ref. 18 , Chang and Patt use a counter based scheme to choose the best performing scheme among different branch predictors. We use this same approach to determine whether the reference pattern for a load instruction is being captured by the cache. For each load instruction a 2-bit state entry captures the recent cache hit behavior for that instruction. The 2-bit entry specifies one of 4 states (numbered 1 to 4) representing recent cache accesses. States 1 through 3 represent a recent cache state in which some references were found in the cache; state 4 represents the situation where few cache hits are occurring. The state is modified by each reference; a cache hit will decrease the state number while a cache miss increases the state number.
The cache placement strategy is then modified to only allocate a cache line on a miss, when the instruction that missed is in states 1, 2, or 3. This means that load instructions that have a cache hit rate of 25 % or more over the last few references will allocate a cache line on a miss. If recent cache references for this instruction have all been misses, then no allocation will occur. This differs from the static method because poor cache performance does not have to continue throughout the execution to mark an instruction C/NA (the load instruction's status, whether it is C/NA or not, can change during the execution of the program). Since the C/NA marking is maintained as part of the miss prediction table, it does not require new types of instructions to be added to the architecture as would be the case with a static scheme.
Experiments were performed using 2-bit counter miss prediction schemes. Initially the size of the miss prediction table was unlimited in order to evaluate the ability of the 2-bit scheme to track a hit/miss history. In later runs the size of the miss prediction table was fixed.
Table VII summarizes the average memory reference activity when using 2-bit counters for miss prediction on the SPEC benchmarks. As in Tables IV and VI for the static schemes, the results are averaged across all the benchmarks for an 8 K byte direct mapped cache configuration. Unlike the results for the static schemes, the C/NA instruction classification is broken down into 3 categories. This is necessary because with a dynamic scheme an instruction can be in the C/NA state only part of the time. Thus, we decided on the three categories: (1) < 5C/NA, the instruction was in the C/NA state for less than 5 % of its references, but for at least one reference, (2) 5-95 C/NA, the instruction was in the C/NA state for between 5 % and 95 % of its references, and (3) >95C/NA, the instruction was in the C/NA state for 95 % or more of its references. Another difference in these tables is the separation of the post-transformation misses into two types, those misses that do not cause a cache line replacement (because the load instruction is in the C/NA state), and those misses that do cause a line replacement.
Looking at the results shown in Table VII , we first note that the number of instructions that spend some amount of time in the C/NA state is much larger than for either of the static methods. This is seen by comparing the first line (C/NA) of Tables IV and VI with the first three lines of  Table VII . Clearly the dynamic behavior of the program has a significant impact on whether the data item for a particular load instruction will be found in the cache. Further comparisons between the static results and dynamic results indicate that, as one might expect, the 2-bit dynamic scheme is moving instructions from the Increase, No Change and Decrease categories into one of the C/NA categories. Overall, this shift increased the average number of memory references by 92.15 %.
Improved Dynamic 2-bit Counter Scheme
As with the first static scheme, the 2-bit miss prediction scheme is too aggressive in classifying instructions as C/NA. Too quickly marking an instruction as C/NA results in the large 92 % increase in memory references. As a next step, we modified the 2-bit scheme to mimic the Improved Static scheme. In the Improved Dynamic scheme, each line of the cache has associated with it the address of the load instruction that brought that line into the cache. On a cache hit, the 2-bit counter associated with the instruction that caused the hit is decremented and in addition, the 2-bit counter associated with the instruction that brought the cache line into the cache is also decremented. Thus, those instructions that do useful prefetching of data for other instructions are not marked as C/NA. Results of simulations using the Improved Dynamic 2-bit miss prediction table are shown in Table VIII .
As can be seen in Table VIII , the number of instructions that are placed in the C/NA category is much smaller when compared with those in Table VII . This results in reducing the number of memory references such that there is actually a 1.36% decrease compared to a conventional cache. The small change in the number of memory references and the small number of instructions in the C/NA categories indicate that perhaps this improved strategy is too conservative in marking instructions whose data should not be cached. Table IX provides a summary of the results of an analysis of the memory bandwidth requirements of the dynamic schemes for each of the SPEC benchmarks. This analysis accounts for transferring an entire cache line from memory on a cache misses and also referencing data items that will not be cached. The data in the table is a computation of the percentage of memory bandwidth required compared to a conventional cache scheme that does not use a miss prediction table. The columns of the table show the average memory bandwidth required for 8 K-byte and 16 K-byte direct mapped and 2-way associative caches using the 2-bit dynamic and improved dynamic strategies.
The results in Table IX indicate that the bandwidth requirements of the dynamic schemes are not reduced as substantially as with the static schemes. This makes sense since with the static schemes, we have more information when marking which instructions should be C/NA. Nonetheless, for most programs the bandwidth requirements are reduced, and in several cases the reductions are substantial. Furthermore, the data in Tables VII-X indicate the trade-offs between caching data items and the resultant bandwidth requirements. With the more aggressive dynamic strategy, where more instructions are marked C/NA, there is more memory activity. However, the memory activity is for a single data item instead of an entire cache line. Thus, there is a reduction in the required memory bandwidth. On the other hand, with the Improved Dynamic strategy, there is less memory activity, but the required memory bandwidth is higher than the simple dynamic scheme (though still less than the requirements of an unmodified cache) since the memory activity involves more fetches of entire cache lines.
Impact of Fixed-Size Miss Prediction Table
The next set of experiments that we performed involved fixing the size of the miss prediction table. For this set of experiments we looked at a direct mapped cache of 8 K-bytes using the first dynamic prediction strategy. The miss prediction table was fixed 4-way set associative, while the table size was varied. The results of these experiments are summarized in Fig. 1 for the initial dynamic prediction scheme. Table) Fig. 1 . Performance of fixed size miss prediction buffer using 2-bit dynamic prediction.
In Fig. 1 , we have plotted the table size on horizontal axis, while the hit rate in the table and the resultant bandwidth requirements are plotted on the vertical axis. As can be seen in the center of the figure, a miss table of 256 entries reduces the average memory bandwidth requirements to a value very close to what one would get with an infinitely large miss prediction table.
OPTIMAL PERFORMANCE OF C/NA
Our experiments have shown that modifying the cache replacement policy has great potential for reducing bus bandwidth requirements with a minimal impact on cache hit rate. How close are we to achieving the optimal performance of the technique? We ran another set of experiments in order to try and answer this question.
In order to calculate the optimal performance of this scheme, we ran experiments in which we assumed full knowledge of the future data reference pattern. (We essentially used Belady's optimal page replacement algorithm (19) on cache lines instead of virtual pages). For standard n-way replacement schemes, we used this information to decide which of the n elements to replace. (Since these are standard replacement schemes, one of the elements did have to be removed). We called this the Must Replace (MR) scheme. The C/NA technique removes this restriction in this case, the decision was whether or not to replace an element, and if so which one. Table X shows the optimal cache hit rates for 4 different cache configurations, assuming the same cache parameters as before (32 byte lines, 4 byte words, write through, nonallocate on write miss). The Standard LRU column is for a standard LRU replacement strategy, which is included for comparison purposes. The Optimal MR column contains data for the Must Replace strategy, in which one of the items in a set must be replaced when processing a cache miss. The Optimal C/NA column shows the data for the C/NA scheme with full knowledge of the future. Table X shows, it is possible in the ideal case for the C/NA strategy to actually increase the cache hit rate, especially for direct mapped caches. In our experiments, our implementation of the C/NA strategy did not accomplish this. However, Table X does shows that in the optimal case a directmapped cache using C/NA has a hit rate that exceeds that of a 2-way set associative cache using LRU and approaches that of a 2-way set associative cache with optimal replacement or a 4-way set associative cache using standard LRU. table and Tables VI and IX we also see that while the improved static scheme comes fairly close to the limit of bandwidth reduction, the dynamic schemes are not yet approaching the ideal.
CONCLUSIONS AND FUTURE WORK
In this work, we have investigated the potential for improving average data access time by being more selective in what data items are cached. This work was motivated by the apparent limitations in the size, organization and speed of first level data caches. To make the data cache smarter with respect to the items it caches, we first examined and analyzed which instructions generated data cache misses. In this analysis, we confirmed and expanded on the results of other work that indicates a very small number of instructions are responsible for a very large percentage of data cache misses.
Based on this observation, we analyzed the impact on cache and memory system performance if certain data items were not cached. In the first part of our simulation studies, we determined whether an instruction's data item should be cached by performing a static analysis of program behavior. The results of these studies indicate that the amount of memory activity, the required memory bandwidth, could be substantially reduced by not caching all data items. Since this static analysis requires executing the entire program and marking which instructions should have their data cached, we then looked at dynamic schemes that could dynamically detect which data items should be cached. The dynamic schemes we investigated, are based on 2-bit branch prediction schemes. Instead of a branch prediction table, we have a miss prediction table that holds a 2-bit counter associated with load and store instructions. We investigated two 2-bit miss prediction strategies. Both of these strategies offered a reduction in a program's memory bandwidth requirements. However, neither dynamic scheme performed as well as the improved static scheme.
We then calculated the optimal performance that could be attained by the various cache configurations and by the C/NA replacement scheme, in order to evaluate potential effectiveness ~ of this approach. Without doing this study, it is virtually impossible to ascertain the amount of ef~brt that should be put into refining and extending the initial studies. Our investigations revealed that in the ideal case the C/NA strategy can provide both a small improvement in cache hit rate and a substantial decrease in necessary bus bandwidth. In fact, in the optimal case, a direct-mapped cache using C/NA will require approximately 60 % of the bandwidth of a cache not using C/NA, and the hit rate will exceed that of a 2-way set associative cache using LRU and approach that of a 2-way set associative cache with optimal replacement. These results imply that this work has the potential to significantly impact processor cache designs in the future, and should be continued.
We have performed a preliminary study of the feasibility of incorporating a hardware-based speculative prefetch unit to extend this work. Caches work well in exploiting the spatial and temporal locality of certain data references, but fail when locality is missing. Prefetch works well when there is regularity in the access pattern regardless of locality. By incorporating a hardware prefetch unit for C/NA items, it may be possible to hide the latency of even those loads that have little locality.
Another possible application of a dynamic scheme similar to the one described in this paper involves dynamically configuring a cache coherence protocol to fit the requirements for each load instruction; instructions that are likely to share data could use a different protocol from those that access local data.
We believe that using a method of dynamic configuration of cache operations like the one described in this paper can have broad applicability. Similar schemes can not only improve the performance of the cache, but can allow for other hardware based memory enhancements to be selectively applied.
