Sector caches use an address tag to identify a sector, and valid bits to indicate whether each subsector is present in the cache. This design permits a small number of tag bits to control a large number of data bytes in the cache. Such a design is useful for single level caches when cache sizes are small and/or when the optimal line sizes are small. Sector caches are most useful when small firstlevel tag arrays are used to control large second level cache data arrays. The problem with sector caches is that because of the constrained mapping between address tags and data array space, frequently many of the subsectors are not filled before the sector is replaced.
Introduct ion
On-chip caches are using an increasing portion of die space on recent processors. Simultaneously there has been a disproportionate increase in the overhead of tag bits used to manage the buffer space in the cache due to the increase in the address space [WSY95, Sez94, Sez97] . One approach to minimizing the ratio of tag to data bits is to use a sector cache [Lip68, HS84, Prz90, *Partial funding for this research has been provided by the State of California under the MICRO program, Sun Microsystems, Fujitsu Microelectronics, Toshiba, Microsoft, Cirrus, and Quantum.
Permission to make digital or hard topics of all or part of this work fat personal or classroom use is granted without fee provided that copies are not made or distributed for profit 0T commercial advantage and that topics hear this notice and the full citation on the lirst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission andior a fee. ICS '99 Rhodes Greece Copyright ACM 1999 l-58113-164~x/99/06...$5.00
Sez94, Sez97, RS99a].
x/q 'DTvl Tqvj E/iq Figure 1 : Symbolic view of a sector frame in a unified or data sector cache. Several subsectors are associated with one address tag. Each subsector has a dirty bit (D) and a valid bit (V). Each subsector can be independently present (V bit is on), invalid (V bit off), dirty (D bit on), or clean (D bit off).
A cache's function is to keep frequently accessed portions of main memory close to the processor in a high speed buffer in order to reduce the delay of accessing memory. When a reference to memory does not find a copy in the cache (cache miss), a number of bytes are fetched from memory in a group, which can be referred to as a cache line or block. Address tags'are associated with each cache line to keep track of the original location of the line in main memory. Caches are successful due to the observed property of memory reference locality, which consists of two forms. Temporal locality is observed when the same location is accessed repeated, such as instructions in a loop. Spatial locality occurs when words neighboring a recently accessed memory location are also referenced. Data arrays and the stream of instruction references in general show this property.
Sector caches, which were the first commercial form of cache, were introduced with the IBM System/360 Model 85 [Lip68] .
Sector caches differ from normal caches by allowing a number of cache lines (subsectors) to be associated with each address tag ( Figure 1 ). As address tag sizes increase, sharing a single address tag among several subsectors can permit a significant savings in cache management bits. However, this savings comes at a cost of reducing the flexibility of mapping subsectors into the cache, generally causing an increase in the miss ratio. Each sector (address tag) has a power of two number of subsectors associated with it at a fixed offset within the sector frame (cache space set aside for all the subsectors belonging to a sector). Each subsector can be independently present or absent while its sector is in the cache.
Note that we use the terminology "sector" and "subsector".
In [Lip68] , these are known respectively as "sector" and "block." In [Go0831 these are called "address block" and 'Ltransfer block." In [HS84] these are called "block" and "subblock. " We believe that the sector/subsector terminology is the clearest and least verbose. For normal, non-sectored caches, we refer to the unit of data transfer and addressing as either a "block" or a "line."
When appropriate, we call a "sector" a "sector frame." The subsector is also called the "fetch size."
Various studies have shown sector caches to be useful. Sector caches can be used to reduce bus traffic with only a small increase in miss ratio, by using small subsectors [Goo83] . In some cases, particularly when the cache is small, the line size is small and/or the miss latency is small, sector caches improve performance in computers with single level caches [RS99a, PrzSO] ; some recent processors use designs with two subsectors per sector [Moo93, T096] .
Sector caches more frequently yield benefits in two-level cache systems in which tags for the second level sector cache are placed at the first level, thus permitting a small tag store to control a very large cache [RS99a] .
The problem with sector caches is that the mapping between sectors and subsectors is constrained, with two negative results.
First, in most cases a sector is replaced before all of its subsectors are occupied with data, causing the cache space to be used inefficiently. Second, when a sector is replaced, all of its subsectors must be replaced, even though only a single subsector is being fetched. For example, Hill [HS84] estimated that 72% of the data in the cache lines had not been touched, [RS99a] found that the number of (four byte) words per sector not accessed while in the cache ranged between 6% to over 90%, depending on sector and cache size (reproduced in Table 1 ). Table 1 shows the fraction of subsectors accessed between the time a sector is read into the cache and the time it is evicted from the cache, demonstrating that for many cache configurations, a significant number of words or subsectors are not touched while a sector resides in the cache. Traditional sector caches are partially able to take advantage of this property by not loading portions of a sector which are not demanded, saving memory bus bandwidth.
However, the sector cache still underutilizes its data space, since each subsector space in the cache is reserved for exclusive use by its corresponding sector frame. What we propose in this paper is a more dynamic form of sector cache, which allows the sector frames (address tags) in each set to compete for subsectors. Less data space is used by the cache, which provides similar performance with less memory allocated to the data storage arrays. In the sector pool organization, subsectors are associated with a set of sectors, rather than with individual sectors (Figure 2) . Assume an N-way set-associative sector cache, i.e., the sectors are organized in a setassociative manner, and then each sector consists of subsectors.
In a "normal" design, each sector would contain K subsectors, and we could diagram the set as N rows, consisting of an address tag (denoting the sector) and K subsectors (with valid bits). The set would thus contain N*K subsectors. Since we've observed that many subsectors never get used, in the sector pool design each column of subsectors contains fewer than N elements, and the subsectors are dynamically assigned to the sectors as necessary. The hope is that we can save a significant amount of the data space without significantly increasing the miss ratio. An example sector pool design (Figure 2 ) uses a 4-way set-associative configuration with 4 subsectors per sector and pool depth 3. There are 3 assignable subsectors for each subsector position [0..3] . This leads to a total of 12 assignable subsectors for the whole set, rather than the 16 subsectors that required by a traditional sector cache. The implementation of the sector pool cache is straightforward. Figure 3 shows the subsector list pointer bits associated with a single sector address tag. This example contains 8 sectors in the set (one sector is shown) which must compete among themselves for 5 subsectors at each offset. Each offset, within the sector frame contains pointer bits to a position within the list of subsectors associated with a set. A traditional &-way set-associative sector cache effectively contains 8 sub- sectors per list; this example sector pool cache uses 5 subsectors in each list. Six valid values are maintained by the pointer bits: 0 for an invalid subsector, and 1 through 5 which are indices into the corresponding subsector list.
Terminology
The following is a list of terms used to describe sector pool cache configurations: l Depth: the fixed number of subsectors reserved by the set for each particular subsector offset. The depth quantifies the number of assignable subsectors available in a list for a particular configuration.
Related Work
Sector caches with a dynamic assignment of subsectors to sectors have been previously proposed.
The purpose. of these sector cache organizations was improve on traditional sector caches to reduce the number of address tag bits while minimizing the impact on miss ratio. This was achieved by increasing the flexibility of mapping subsectors to sectors.
In [Sez94, Sez97] , an organization was proposed in which address tags (sectors) were grouped together to compete for subsectors.
For each subsector position, only one sector out of N in the grouping can use that particular subsector offset at a time. Each subsector maintains a selector tag, which allows it to determine the sector to which it belongs. The grouping of sectors that share subsectors can be unrelated (direct mapped) or belong to a set. The sectors and subsectors can have independent varying degrees of set-associativity.
In the simplest case, the subsectors are accessed just like the blocks in a direct mapped cache using some of the upper address bits as an index into the cache. The sector address tags (for example, kept in pairs) are accessed using a smaller subset of the upper address bits. Associated with each subsector is a valid bit and a selector bit which points to one of the pair of sector address tags. Once the selection bit has been read and the sector address tag has been obtained, it is possible to determine whether the subsector access was a hit or a miss.
When compared to normal caches of the same associativity, the decoupled sector cache was reported to have worse behavior, with up to a 337 percent increase in the miss ratio, although with the increase falling mostly in the 20 to 80 percent range for the larger workloads. Increasing the set-associativity for the sectors and (independently) the subsectors was shown to improve the performance of this type of organization.
For large caches (256K and 1M bytes), the miss ratio (using 32 byte subsectors) showed reasonable performance with some workloads, while reducing the number of tag bits up to 86 percent. However, some of the more robust workloads (such as doduc and gee) showed much worse miss ratios (2 to 11 times higher) for certain configurations. The best overall performance generally occurred when the set-associativity of the tag array was close to the number of sector tags competing for each subsector. Invalidations appear to be difficult and slow in this organization, due to the necessity of querying all the subsectors (up to 64 in some cases) in a region of the cache to determine which ones to evict when a sector is replaced (see discussion below in Section 3.4).
An even more dynamic organization was described and evaluated in [WSY95, WSY97] . The authors noted that many address tags in a cache contain the same values, allowing the possibility of reducing tag space by creating a small address tag array (sector tags), with each cache line maintaining a pointer into the sector tag array. Like a sector cache, when an address tag is removed from the tag array, all associated cache lines (which behave like subsectors) must also be removed from the cache. What makes it different from a normal sector cache is that an arbitrary number of subsectors can be associated with each sector. The results show that for a 16K byte cache using 16 byte lines, a tag array with 32 entries has similar performance to a normal cache, but with a substantial savings in tag space. However, one important issue with this type of organization is the associative search required to remove subsectors on sector replacement.
Methodology
This research used trace driven simulation to derive miss ratio results for each point in the design space. We examined a large range of cache parameters, varying the cache size, the number of subsectors per sector, the size of subsectors, the set-associativity, and the subsector pool depth associated with each set. To properly evaluate our design, we consider miss ratios and mean memory delays as a function of the number of bits required by each cache organization to measure the cost and benefit. Our cache simulations used 24 separate traces, combined in a multiprogramming workload, as described below. A summary of each of the trace characteristics can be found in Table 2 . These programs are a combination of engineering and scientific applications, SPEC92 integer and floating point workloads, a few commonly used Unix utilities, and a two traces of an IBM System 370 containing user and operating system memory references to make the traces more realistic. A fairly detailed discussion of the traces appears in [RS99a, RSSSb].
Simulated Multiprogramming
An examination of Table 2 shows that in many cases the number of bytes accessed by these workloads is insufficient to cause very many misses from caches larger than 16K bytes (particularly instruction caches), let alone provide any meaningful results for the maximum size of caches under test (512K bytes). In addition, it was noted in [Smi85] that the usual method of exclusively utilizing user space references results in over-optimistic Workload Miss Ratios estimates of performance for several reasons, including: 1) task switching in real systems causes all or part of the cache to be flushed or overwritten, making the performance of the memory system worse than would be predicted from uniprogrammed trace analysis, and 2) OS code typically has much worse cache access characteristics, with much higher miss ratios.
To address these issues, we combined all the traces into a single trace which simulates multiprogramming, using task switching probability information observed in real systems [Kob86] . Our process is as follows: the LRU stack model is used to model process scheduling; the LRU stack consists of all of the processes in the system, ordered from most recently executed to least recently executed.
At the time of a task switch, a new task is selected using the the stack program locality probabilities derived in [Kob86], "run" for a time quantum and then pushed back on the top of the stack. The time quanta have an exponential distribution with mean 20,000 instructions executed. "Running" a process means using sequential memory addresses from that process. Each address is tagged with the process ID, so the cache is not flushed on a task switch. This process is repeated until all the traces have been exhausted.
This multiprogrammed trace has two major advantages: it displays realistic multiprogramming behavior, and it references a large enough address space to fully exercise the caches under test. The last entry of Table 2 shows some of the characteristics of the multiprogrammed trace. This new trace references more than 9 MBytes of address space, which is sufficient to fully exercise any cache design under consideration in this paper. Figure 4 shows the miss ratios for unified (combined instruction and data) normal (non-sectored) caches over a range of cache and block sizes. It demonstrates that the multiprogrammed workload succeeds in providing a reasonable set of references that have enough capacity misses to properly evaluate caches up to 512K bytes in size. The trend of improving miss ratios with cache size shows little sign of saturation for the largest caches under examination.
A comparison of these miss ratios shows that they fall within the range of values established by other research. The values here are lower than the design target miss ratios for 8-way set-associative caches generated in [HS89], generally about 50 percent of the value for unified caches, a little less than 50 percent for instruction caches, but very similar values for data miss ratios. Compared with the miss ratios for the SPEC92 workloads generated in [GHPS93] , our instruction cache miss ratios are 50 to 100 percent higher, the data miss ratios are 30 to 50 percent lower, and the unified cache miss ratios have generally the same values.
Metrics
In this paper, we present two principal metrics for cache performance. Cache miss ratio is the normal metric, but in systems such as the ones we consider, with varying fetch sizes and multiple levels, not all misses are equal; some take much longer than others. Therefore, we also consider average memory access time, which consists of the cache hit time plus delays associated with cache misses.
Design Parameters and Resources
An important issue in evaluating the effectiveness of the design of any particular cache organization is the amount of resources required to implement the design in hardware.
Such a metric could be the number of transistors, the number of bits, the die area required to implement the design, etc. For simplicity, we use the total number of address tag and data bits. A more detailed evaluation would involve measuring the use of silicon for comparators, data lines, multiplexors, etc.; a study at that level is beyond the scope of this paper.
Calculating the number of data bits for a particular design is straight-forward.
We also need to consider the number of bits needed for address tags, valid and dirty bits, time-stamping for replacement, and to point from sector tags to subsectors. Since we are focusing on future architectures, we assume a 48-bit address space, which we believe is reasonable for the newest generation of 64-bit machines.
Replacement Strategy
One influence on the performance of a cache as well as on the necessary bit resources is the choice of replacement policy. In the case of a sector pool cache, replacement is an issue both for the sectors and the subsectors in a set, rather than just the sectors (blocks) in normal caches. One possibility we investigated used LRU (least recently used) replacement independently for sectors and for each list of subsectors in a set, as well as a scheme which made subsector replacement dependent on the set's LRU information for the sector address tags. The results showed that both schemes had very similar performance, with the dependent LRU subsector replacement policy behaving very slightly better. Allowing the subsectors to use the set (sector) LRU information for subsector eviction requires fewer bits (no LRU bits for each list). The dependent subsector eviction works by determining which sector frame had any of its subsectors referenced the longest ago, (i.e., the LRU sector) and evicting the corresponding subsector within the pool needing replacement. This may take a few cycles to determine, but since write-back of a dirty subsector can be handled in the background while stalling for the fetch that caused the subsector eviction, it is not time-critical.
Our calculation of bits for each type of organization uses the dependent subsector LRU design.
The trace driven simulations used a true LRU replacement policy for sectors in each set. Accurately implementing such a policy can be done theoretically with [logz(n!)] bits, and efficiently with v bits, where n is the number of items to maintain in order. An approximation to &-way set-associativity is that used in the IBM 370/168-3, which requires only 10 bits per set. The eight sectors are divided up into pairs, using a bit to maintain LRU ordering for each pair. Then true LRU order is kept for the four pairs, which requires six bits. We assumed 10 LRU bits per set for %-way set-associative caches.
Set-Associativity and Pool Depth
Three possible set-associativities (2,4,8-way) were tested for the sector pool cache design. The results we present here are for 8-way, because of the range of sector pool designs that can be evaluated; 8 or larger degrees of associativity have been used in various mainframes, but are uncommon in microprocessor based systems. Various clever schemes exist to yield good performance despite the delays associated with high levels of associativity [PHS98] . The IBM 3033 used 16-way associativity [Smi82] .
Since subsectors are associated with the set instead of sectors in this design, pointers need be maintained to indicate where the subsector is found in the set's list of subsectors. This requires rZogz(depth + l)] bits for each subsector in each sector frame; one value is used to indicate that the subsector is absent (not valid), the other values are used as a pointer to a location in the subsector list (Figure 3 ). When the cache is fully populated (i.e., the pool depth is equal to the associativity), only a single pointer bit is required (it becomes the valid bit), because the cache is a traditional sector cache and has subsectors at fixed positions for each sector.
3.3
Tag Bit Calculations where Bits~~u is the number of bits to implement an LRU strategy (10 per set for 8-Way Set-Associativity), BitsDirt, is 0 for instruction or 1 for data and unified caches. In the case of a regular sector cache, Bitspt? is equal to 1 and represents the valid bit.
Sector and Subsector Eviction
Each sector maintains pointers to its valid subsectors, so it is relatively easy to determine which subsectors in a set need to be evicted when a sector is replaced. The most similar previous work using a dynamic subsector mapping ([Sez94, Sez97]) maintained pointers from the subsectors to the corresponding sectors, which can make sector eviction slow and complex. On sector eviction it would be necessary for that scheme to search a significant fraction of the cache to determine which subsectors point to the sector being evicted. This search involves reading the status bits from all subsectors which could possibly belong to the sector being evicted, causing a significant number of cache access cycles to determine which subsectors should also be evicted. The sector pool design associates the placement information with the sector, allowing a much faster determination of the group of subsectors to be evicted when a sector is replaced. The miss ratios for a large variety of sector pool cache organizations were computed using our workload. Table 3 shows the miss ratios for a unified (combined instruction and data) cache, corresponding tables for the miss ratios of instruction and data caches can be found in [RS99b] . Figure 6 shows a graphical representation of Table 3 for 64K caches, with a comparison to normal 64K caches. The parameters varied were nominal cache size (4K to 512K bytes), sector size (16 to 512 bytes), number of subsectors per sector (1 to S), the setassociativity (2 to 8), and the number of possible depths (1 to 8). The bus width was set to 8 bytes, which sets the lower bound on the minimum subsector size. We use the term nominal cache size to indicate that there are enough address tags to manage a cache of that size were it a traditional sector cache, but for sector pool caches with depth less than the set-associativity, the Sector Pool Memory Delay 
Single Level Caches
An initial comparison of the miss ratio and additional memory access delay looks promising compared to similar sector cache designs (sector pool cache with depth 8). For example, Figures 6 and 7 demonstrate that for caches with a sectoring degree of 4 or 8, decreasing the pool depth by up to 3 has little influence on performance (corresponding graphs for instruction and data caches can be found in [RS99b]). Particularly in Figure 7 , the bars showing depth 5 through 8 are almost indistinguishable (e.g., 128 byte sectors with 32 byte subsectors). However, an important point to note is that the performance of normal caches (leftmost set of bars in Figures 6 and 7) contain at least one bar that outperforms all the sector caches. For any particular cache size and ignoring the overhead resources (tag bits) required to implement it, a normal cache will outperform a sector cache due to the flexibility of mapping cache lines into the cache. A more proper comparison, however, would consider tag bits as well as data bits, and would consider memory access time, not just miss ratio.
The index of performance used in this paper to compare various cache organizations is additional delay, Table 3 : Unified miss ratios, real cache capacity is v * Cache Size.
which does not take the intrinsic cache access delay of 1.0 into account. Additional delay (measured relative to processor cycle time) was determined for the single level cache using the formula: additional delay = mr * (overhead + 5 * subsector/buswidth), where overhead is either 5 or 15 cycles, bus width is 8 bytes, "subsector" is the subsector size, and "mr" is the miss ratio. We choose 5 or 15 cycles by assuming a system using a 500 MHz CPU with a 1OOMHz 8 byte wide memory bus, using SDRAM with an X-l-l-l response time, where X is 2 or 4 for first level caches and 5 for two-level caches. Note (see [Smi87] ) that the minimum delay as we vary various parameters is dependent on the ratio of the overhead latency to the bus width transfer rate, and not on either independently. For example, a 1OOOMHz CPU using a 1OOMHz motherboard would have double the observed miss induced memory delay (in processor cycles) of the delay values presented here, but the optimal configurations determined here would remain the provide results that range from comparable to to substantially better than the normal cache designs. For example, in Figure 8 with overhead 15, a normal 32K byte cache with 128 byte blocks has an additional delay of 0.5685 and requires 33.23K bytes to implement. A sector pool cache with 32K nominal size with 64 byte sectors and 32 byte subsectors with depth 5 has an additional delay of 0.5670 and requires only 22.83K bytes to implement, which represents a saving of 31.3 percent in overall bits. Note, however, that a normal cache with a 64-or 32-byte block yields still better performance and requires only a few more bits to implement than the 12%byte block example. The performance of the best sector pool caches follow the trend of the normal caches, deviating by only a few percent in the best cases. The chief advantage of single level sector pool caches is that the flexibility of the organization allows any arbitrary size of cache to be implemented and is less restricted in the size of the data space. As can be seen, sector caches are rarely better than normal caches for single level caches, which was also noted in [RS99a].
Two-Level Cache Organizations
Two-level caches have become increasingly common, because they often yield higher performance. A small cache is often significantly faster than a larger one, and thus the cost of more misses from the smaller first level cache is more than compensated for by the shorter access time, which itself often translates directly to a higher clock frequency [JW94] . A desirable configuration for a twolevel cache is for the first level to hold both the first level cache itself and the tags for the second level cache. In this case, a second level sector cache has the advantage that the limited number of bits available on the first level can control a very large set of data arrays on the second level. We consider that design in this section.
To evaluate two-level cache organizations, we assume that the processor die contains the entire first level cache and the tags and the related control circuitry for the second level cache. The time for a hit in the second level cache is two cycles plus one cycle for each eight bytes transferred from the L2 cache to the Ll cache. The formula for measuring the delay time is: additional memory delay = mrl * (2 + subsectoq/8) + mr2 * (20 + 5 * subsector2/8).
For the two-level experiments presented here, we examined first level caches from 4K to 256K bytes. The second level cache varied from 8K to 512K bytes. The sector sizes were varied from 16 to 512 bytes and the subsector sizes between 8 and 512 bytes. The bus width between the CPU, the caches, and main memory was set to 8 bytes. Both Ll and L2 caches were eight-way set-associative and all possible combinations of normal, sector, and sector pool organizations for first and second level caches were examined to find the minimum miss induced memory delay for a given upper limit of on-chip cache bits.
Due to the computationally intensive nature of multi-level cache simulations, the miss ratios of the single level simulations were used to generate the results in this section. This may violate the inclusion principle of multilevel memory hierarchies when the ratio of the sector sizes from Ll and L2 caches is large [BWSS] . This could possibly cause an underreporting in the L2 miss ratio, but we do not believe this will have a significant impact upon our results.
Sector pool caches are never an optimal choice for an L2 cache, because they require more tag/control bits than normal sector caches due to the necessary pointer bits associated with each sector. Using a sector pool cache as the first level cache and a normal sector cache as the second level cache in a two-level cache organization generally shows the best performance. When the second level tag and control bits are on-chip, a level two sector cache allows the possibility of controlling the most cache with the smallest number of bits. The combination of an Ll sector pool cache with an L2 sector cache with on-chip tags provides significantly better performance (in terms of memory access delay) than a normal cache, when the gross on-chip cache size is limited.
The results presented here use 8-way set-associative sector pool caches exclusively, due to the large degree of freedom in picking the pool depth. Results for the 2-way set-associative sector pool cache, which is the most similar to one of the organizations described in [Sez94, Sez97], show little promise except for the smallest cache sizes (around 4K); the only alternative to the fully populated version of the 2-way pool is depth 1, which is direct-mapped, and conflict misses become a major issue. Results for a 4-way set-associative sector pool cache show some improvement over normal and sector caches up to about 16K of on-chip space and little advantage above that, again because conflict misses are an issue. Table 4 shows an excerpt of the best configurations for a given gross on-chip cache size, which includes the whole level one cache and the tags for an off-chip level two cache. For the unified, instruction and data address streams, three different organizations (normal, sector, and sector pool) are presented. A fully off-chip 512K 64-byte block normal level two cache is also considered for each configuration. This cache is two cycles slower to access than a level two cache with the tags on chip and uses a direct-mapped cache, which is useful for reducing off-chip data multiplexing complexity. The direct-mapped miss ratios were computed using the relative miss ratio changes for &way to l-way setassociative caches in [HS89], which increases the miss ratio by about 60 percent.
The results show that sector pool organizations for a first level cache outperform normal caches, with a large difference in memory delay for small cache sizes. However, once there are enough on-chip bits to control the largest off-chip cache size simulated, performance for the various organizations begins to converge. For the largest on-chip bit budget allowed, the best organizations use only normal caches. In some cases (unified and data caches), the extra delay of an L2 cache only twice the size of the Ll cache results in certain configurations that are better off without an L2 cache. We should point out that sector pool caches are useful only when there is a payoff to saving the cache space and reallocating it to other purposes. Once the miss ratio has leveled off (as for the largest caches we simulate), sector pool caches are no longer advantageous. For some workloads, however, such as data bases, transaction processing, and operating systems intensive, the miss ratios would be higher and would continue to drop for much larger cache sizes. In such an environment, it could be expected that sector pool caches would retain their advantages for very large cache sizes.
An example of the trade-offs between sector, normal and sector pool caches can be seen in part of one row in Table 4 , which shows the three organizational choices for a data cache with a maximum allowed 12.92KB gross on-chip cache size. The sector pool cache utilizes 25 percent of its potential data space as level two tag bits to enable fast off-chip cache management.
The slight increase in the first level miss ratio is compensated by a lower miss penalty by using a faster responding second level cache unit. This trade-off decreases delay induced by misses by about 8.0 percent over the sector cache organization and 15.5 percent over the normal cache organization.
Given the choice between an inexpensive and slightly slower off-chip cache, or to devoting on-chip bits for address tags to control a costlier and faster off-chip cache, the more expensive (in our cost model) off-chip cache wins in most of the choices in Table 4 . This shows that it is useful to conserve on-chip bits by using a sector pool cache, which can be applied to speeding up the L2 cache.
An interesting feature of Table 4 is the pattern by which level two cache is added as the on-chip resources increase.
Given a choice between increasing the L2 cache size while keeping large (and inefficient) subsector sizes, or decreasing the subsector size with constant L2 cache size, increasing the cache size is the preferred option.
Both options require roughly the same number of tag bits, as both require the same number of address tags. By using a sector cache organization for the L2 cache, more efficient subsector sizes can be utilized while keeping the total number of tags constant. Sector caches are thus a powerful method of adding second level cache with reasonable L2 fetch sizes when the number of on-chip bits is constrained.
Summary of Results
From the results presented here, it is possible to see that the sector pool cache provides small, but in some cases significant improvements in performance for single level caches. The best use for sector pool caches, however, is for use as the first level in two-level designs. Sector pool caches perform best when used as a first level cache in a two-level cache organization with second level cache tags on-chip.
Particularly when the number of on-chip bits is small, sector pool caches allow more flexible organizations that can trade Ll cache data bits for L2 cache tag/control bits to more precisely manage larger off-chip caches. Sector pool caches are not a good choice for the off-chip cache; regular sector caches perform the best in that role over most of the range we examined.
We note that to a large extent, it is possible to deduce our simulation results (of course, had we simply deduced them, readers would rightly have insisted on simulations to confirm our insights).
Sector caches trade mapping flexibility for a savings in tag bits. This tradeoff is worthwhile when tag bits are a significant component of the storage bits. Tag bits are significant when the line sizes are small. Small line sizes are best when transmission time for a miss is much higher than latency and when cache sizes are small [Smi87] . So sector caches are most useful when caches are small and miss latencies are small. Sector pool caches are most useful, as explained above, when the savings in data array area can be put to good use; that occurs most conspicuously when a first level sector pool cache shares space with tags for a second level sector cache.
With the perpetual increase in transistor and gate count, even first level caches are becoming relatively large, and for such large first level caches, sector pool caches are of limited use. There is a long history, however, of proposed designs in which multiple CPUs were located on each chip, for example the machine built and sold by Thinking Machines Corporation.
Such designs have worked poorly in the past because the best use for the available gates or transistors was not in creating multiple CPUs, but in making the single CPU more powerful, and giving it an on-chip cache memory (or a larger cache). With transistor counts passing the 10 (or 100) million mark, this design choice needs to be reassessed. At some point, which is likely to be reached soon, a single chip could preferably hold two or more CPUs, each with reasonably sized caches. In that case, a design with a sector pool cache (tags at the first level, data array at the second level) should be very attractive; such a design could have a shared second level cache either on-chip or off-chip.
Conclusions
We have presented and evaluated a new cache design called the sector pool cache. Regular sector caches associate several transfer units (subsectors) with each address tag (sector frame) in the cache. Studies have shown that many of the subsectors are not accessed before the corresponding sector is evicted from the cache. Sector pool caches attempt to utilize this potentially wasted space by sharing a pool of subsectors among all the sectors in a set. Each set maintains a list of subsectors for each possible subsector position in a sector frame. Associated with the sector frames are pointers into these lists. When the number of subsectors in each list is smaller than the set-associativity of the sets, less data space is used than in a normal cache, but with only a small impact on the miss ratio and particularly on the observed memory delay.
Sector caches require inflexibility in subsector allocation in return for significant savings in tag space relative to normal caches. As was shown previously in ,[RS99a], sector caches are therefore useful in a limited number of cases, those in which the savings in tag space can be devoted to reducing the number of misses or the miss penalty (such as by adding more data space or more efficiently controlling large off-chip caches), sufficient to offset the increase in miss ratio due to the constrained mapping. Sector pool caches save even more cache space, and therefore are useful in a larger number of cases. Sector pool caches generally have the best performance for almost any specified number of bits; however, since there many more potential configurations of sector pool caches this should not be a great surprise. The best performing sector pool caches follow the performance trend of normal caches for single level on-chip caches.
Sector pool caches are most useful when used as the first level cache for two-level cache architectures. In a system that contains the first level cache and the tags for the second level cache on the same die, the versatile combination of a sector pool design as the level one cache and a regular sector cache design for level two outperforms normal or plain sector cache configurations, in many cases by large amounts. This combination is particularly useful for small onchip caches, as it allows control of a large second level cache by trading-off some of the data space of the first level cache, and using those bits for second level cache address tags.
Our test methodology used a trace-driven simulated multiprogrammed workload to strenuously evaluate the variety of cache designs. Previous research [GHPS93] has shown that many of the workloads traditionally used as benchmarks to evaluate caches tend have low and unrepresentative miss ratios that do not put much stress on the memory subsystem. Our simulated workload accesses many more unique memory locations than the bytes available in the caches under test, which results in a fairer and more realistic evaluation of the various configurations.
