Introduction
Cache memories are high speed buffers used to temporarily hold, for repeated access, portions of the contents of a larger and slower memory. Most modern caches are organized as a set of entries. Each entry consists of a block (or line) of data, and a tag; the tag is the location of the data in the larger (main) memory. The cache is accessed associatively -the key to the cache is not the location in the cache but the location in the main memory. To accelerate access, of course, the cache may not be fully-associative, but rather may be set-associative, direct mapped or hashed.
The first commercial cache, in the IBM 360/85 [ 121, used a slightly more complex design, called a sector cache. A sector cache is organized as a set of sectors; associated with each sector is an address tag. The sector itself is divided into subsectors. Each subsector has a valid bit, and only some of the subsectors of a sector may be present. When there is a miss to a sector, a resident sector is evicted, an address tag is set to the address of the new sector, and a single subsector is fetched. When there is a miss to a subsector when the sector containing it is already present in the cache, only that needed subsector is fetched. This first cache, in the IBM 360185, was 16 Kbytes, and consisted of 16 sectors of 16 64-byte blocks. Note that we use the terminology "sector" and "subsector." In [ 121, these are known respectively as "sector"
and "block." In [ 5 ] , these are called "address block" and "transfer block." In [7] these are called "block" and "sub- The original reason for the sector cache was technological; the discrete transistor logic of the time made a sector design easier to build than the currently more common non-sectored design. Unfortunately, the performance of the sector design in the 370/168 was inferior to the nonsectored design, as was shown in [7] , and sector cache designs largely disappeared; one other machine with a sector cache was the Zilog 28000. The problem with the sector design was that a sector would typically be evicted from the cache before all of its subsectors were loaded, and thus a large fraction of the cache capacity would be unused.
Sector caches do have, however, one important advantage. In a normal (non-sectored) cache, the only way to have a very large cache capacity with a relatively small number of tag bits is to make the cache blocks (lines) very large; the problem in that case is that every miss requires that a large block be fetched in its entirety. With a sector cache, it is possible to fetch only a portion of a block (or sector), and thus the time for a miss, and the bus traffic, can both be significantly reduced. Thus although it is likely that sector caches will have higher miss ratios than normal caches, there is the possibility that when timing is considered, the sector cache will be found to have better performance.
There have been a few previous studies of sector caches. Early studies include [5, 7, 61 . [I41 provided a more extensive and thorough study, and found that in some cases that sector caches outperformed standard caches.
In this paper, we provide a more thorough and extensive study of uniprocessor sector cache designs than has previously appeared. We start by creating a standard workload and then we calculate (using trace driven simulation) miss ratios for a wide range of sector cache designs. Those miss ratios are transformed into Design Target Miss Ratios, which are intended to be an average and representative workload [23, 241. The miss ratios are then used to estimate performance, using typical timings, for a variety of one-level and two-level cache designs.
As we will show below in detail, we find that for singlelevel caches, sector caches are seldom advantageous. For multi-level cache designs with small amounts of storage at the first-level caches, as would be the case for small on-chip caches, sector caches can yield significant performance improvements. For multi-level designs with large amounts of first-level storage, sector caches can provide small improvements. 
Methodology
Our main concerns in evaluating sector caches were (1) finding a diverse group of workloads to fully test the caches; (2) capturing a large chunk of the full execution of each workload to provide a diversity of program behaviors; and (3) using large enough traces to cause the majority of misses to be capacity misses rather than cold-start misses. We addressed these issues by using a wide variety of workloads combined together into a single multiprogrammed trace.
The results for this paper were generated using truce driven simulation (TDS) [25] . All but two of the traces were generated on a MIPS R3000 based workstation using two trace generation tools (Cerberus [16] and Pixie [26] ). Approximately 100 million instructions (with accompanying data) were collected from each trace (those that were not run to completion) to create a full range of events for simulation. It was noted early on in the simulation process that the range of instruction and data space accessed by the individual traces generally was not big enough to fully exercise (and thus get meaningful results from) caches larger than 64K bytes. (See [ 171). This problem was also noted in [8] for short traces. The solution we developed was to simulate a multiprogrammed environment, described in Section 2.2.
Workloads
A diverse group of 19 programs were chosen to drive the sector cache simulations. The characteristics of the programs can be found in Table 1 . The workloads come from five basic categories: SPEC95 Integer, SPEC95 Floating Point, UNIX programming languages (also in SPEC95, but we traced the versions already available on our workstation), multimedia applications (ijpeg is also from SPEC95), and two IBM/370 traces including user and supervisor memory references. The SPEC95 integer and floating point programs are available from [27] .
In [17] , we show the data and instruction miss ratios for all of the individual traces run straight through without multiprogramming, and the multiprogrammed trace, using 16-byte cache blocks; as is explained below, the traces were combined into a single multiprogrammed trace for our studies.
As noted in [23] , examining user space references exclusively results in over-optimistic estimates of performance for two main reasons: (1) task switching causes all or part of the cache to be flushed, making the performance of the memory system worse than would be predicted from a simple workload evaluation; ( 2 ) OS code is typically large, not very compact, and has frequent branches. In fact, it is well known that commercial workloads and operating systems workloads have miss ratios that are much higher than are observed for the types of programs typically studied by trace driven simulation [9] (also see the discussion in [4] of this issue). For this reason we have simulated a multiprogramming workload using actually observed task switching patterns [lo] , as discussed in the next section and have used the MVStrace to add some OS references into the cache simulation.
Multiprogrammed Traces
In order to obtain a trace which displayed miss ratio characteristics closer to those observed in practice, and also in order to ensure that the address space referenced was significantly larger than the largest cache to be studied, we combined all of the individual traces into a single multiprogrammed trace. This approach also thereby yielded an "average" over the constituent traces, without having to explicitly compute an average, and also more closely resembles the workload on a real machine, which is typically multiprogrammed.
A published study of task switching behavior [IO] was used as the basis for our multiprogramming model. In that paper, it was shown that the LRU stack model could be used to represent the sequence of active tasks. That is, let all of the tasks in the multiprogramming set be placed in an LRU stack. When it is time to select another process to run, the i'th task in the LRU stack is selected with probability p(i).
We approximated the stack distance probability distribution p(i) from [ 101 by the following equation:
p ( x ) = 0.05e-0.315" + 0 . 8 5~-~. '
(1) where x is the MRU distance from the top of the stack (z E [l..n]), 1 is the most recently used task.
To generate our multiprogrammed trace, each of the original traces was treated as a separate task. The tasks were organized into an LRU stack. The initial ordering of the task stack was arbitrary but consistent for all simulations. A task was chosen using the stack probability function (Equation I), run for a time quantum, and put back on the top of the stack. The quanta had an exponential distribution with mean 1,000,000 memory references. Each trace was considered to reference a separate virtual address space, and each block in the cache was tagged with a 6-bit address space ID; thus the cache was not purged when the active task was switched. This process was repeated until all the traces were exhausted.
Simulated Timing
Some of our simulations were used to generate only miss ratios, and some incorporated timing in order to estimate performance. All times were in numbers of processor cycles, rather than in absolute real times. For the timing simulations, cache hits always returned results in a single cycle; the delay caused by cache misses depended on the particular timing being simulated. We considered two different main memory latency times (6 and 15 cycles) which would be appropriate for single-level caches. For two-level caches we used a longer main memory access delay (24 cycles), with slower per word transfer times. See [17] for a more detailed discussion of the timing issues.
Cache Parameters
Workload Miss Ratios All of the simulations conducted here used a fullyassociative cache. Although in real caches associativities of 1 to 4 are most common [22] , the use of a fully-associative design allows us to study the design of sector caches separately from the effects of limited associativity and conflict misses. As shown in [21, 81 , the effects of limited associativity can be calculated from the fully-associative miss ratio function. In any case, the miss ratios for 4-way and 8-way associativity are very close to those for fully-associative designs. The simulations used a true least recently used (LRU) replacement policy with demand fetching of data from memory, and allocation of sectors in the cache for write misses (write allocate).
Besides the usual cache design parameters of cache size and block size, we also study a factor called degree of sectoring for subdividing a block into smaller fetch units. The unit associated with the tag (the sector) is varied between 16 and 512 bytes. The sub-unit used for fetching (subsector) ranges between 4 and min(sector size, 512) bytes. The range of cache sizes under test was 4K bytes to 5 12K bytes.
Tag Bits
The number of bits used to control the cache depends on a number of factors. These factors include the cache size, the number of sector frames (places to put a sector) in the cache, the set-associativity, bits for validity and dirtiness ofthe subsectors, and the number of bits in the address. Instruction caches only need to determine if a subsector is valid; data and unified caches need an additional bit to determine if the subsector has been written, requiring a writeback when the sector is evicted from the cache. For calculations involving the number of tag bits, we use the number of bits required to maintain the LRU replacement policy for an 8-way set-associative cache (although the simulation results were generated by a fully-associative cache simulator). The rest of the bits are those bits necessary to uniquely identify each sector (address tag bits) and to maintain status (valid and dirty bits). We refer to the combination of all these types of bits as the tag bits. See [ 171 for a more detailed discussion of the computation of the number of tag bits needed in each case.
Simulation Results

Sector Cache Miss Ratios
One of the most widely used measures of cache memory performance, the miss ratio, consists of the ratio of references that are not satisfied by the cache to the total number of references. Unlike memory access delay (discussed in Section 4. I), the miss ratio provides an implementationindependent measure of cache performance. From the miss ratio a number of other useful statistics can be calculated, such as the fetch trafic (average bytes fetched per memory reference), trafic ratio and the average niemory access delay. The traffic ratio is the ratio of traffic between the processor and main memory with and without a cache, which takes fetch size and eviction traffic into account. Memory access delay is the effective access time per memory reference, which includes such factors as the miss ratio, fetch size, memory latency and memory bandwidth, Table 3 shows samples of the miss ratios for unified, instruction and data caches determined by our cache simulations; the unified miss ratios are also displayed in Figure 2 . The general features to notice about the miss ratios are that for a fixed subsector size, increasing the sector size makes the miss ratio worse, due to inflexibility in being able to map the subsectors into the cache. The best miss ratios for a given subsector is when it is equal to the sector size (as in non-sector caches). We ran several regressions to attempt to predict the miss ratio from the input parameters of cache, sector, and subsector size. The first (and more complex equation we tried) is:
where c is the cache size, s is the sector size, and ss is the subsector size, all sizes in bytes. This is an empirical equation that combines the factor parameters in a straight- Table 2 . Regression of the miss ratios from Table 3 for the three types of caches. c is cache size, s is sector size, and ss is subsector size. 
and was found to be fairly close in predicting the unified and instruction miss ratios as measured by the R2 closeness of fit (Table 2 ). An analytic derivation of the relationship between the miss ratio and the cache size in [I51 predicts that the miss ratio should fall proportionally to c-', but it reports values of c-0.49 and c-0,54 taken from experiments. These values come from the combined weighted miss ratios of instruction and data streams using a Harvard Architecture (separate instruction and data caches The sector utilization illustrates why some sector caches perform reasonably well. In our cache simulations only the demanded sectors are brought into the cache, and so at least one subsector must be touched while the sector is present in the cache. This constrains the utilization to be: 5 utilization 5 1.0. If the utilization is closer to the lower bound, then much of the cache space is not being used, and the miss ratio of the cache will be close to that of a cache reduced in size by a factor equal to the number of subsectors per sector (e.g., given four subsectors per sector, the effective cache size will one fourth as big). If the utilization is closer to the upper bound (good spatial locality), then the miss ratio will be that of a cache using a full sector fetch, multiplied by the number of subsectors per sector.
subsectoT sire size
Design Target Miss Ratios
A problem with any study of cache and memory system design is to come up with a set of "typical" miss ratios. In 1231 a set of "design target miss ratios" (DTMRs) were proposed, as lying within the range of the various published measurements, and they were put forth as suitable for the role of "typical." Results in [4] suggest that they serve that role reasonably well. The original DTMRs were for a fullyassociative cache with 16- tended to a wider range of line sizes in [24] and to varying degrees of associativity in [8] . In each case, the extrapolation method used "ratios of miss ratios." In that technique, a set of simulations were used to measure the ratio of the miss ratio for cache configuration X to that for configuration Y (i.e., R M R ( X , Y ) = # for those simulations, and then DTMR(X) was computed from DTMR(Y) using
R M R ( X , Y)).
The assumption was that although the absolute value of the miss ratios for a given configuration would depend strongly on the workload, the ratio of miss ratios could be expected to be much more stable, which was borne out by the measurements. The ratios of miss ratios for the multiprogrammed workload are used here to extend the results from [24] to provide DTMRs for sector caches. These DTMRs are also used to estimate the design target traffic ratios, as shown in detail in Several approximations were made to provide DTMRs outside the range of cache and block sizes in [24] . For cache sizes above 32K bytes (the largest cache size in [24] ), we use a rule of thumb that the miss ratios drops roughly as cache ~2 z e -O .~ as suggested in [22] , which is born out by our regression for the data miss ratio (Table 2) , and is consistent with results reported in [15] . Our regression results show that the instruction cache miss ratio falls almost directly proportionally to the inverse of the cache size, but we use the more conservative inverse square root of the cache size for our DTMRs.
The new sector DTMRs are derived from the established DTMRs and the ratio of miss ratios using the formula: [24] .
The DTMRs for sector caches generated from the method described above can be found in Table 4 (a more complete table can be found in [ 171) . The data points where the subsector and sector sizes are equal have the same value as the DTMRs in [24] for caches from 4K to 32K bytes with block sizes from 16 to 128 bytes. The overall comparison of the simulated miss ratios (Table 3 ) with the DTMRs shows that the DTMRs are somewhat more pessimistic (i.e., higher), particularly for unified and instruction caches. This is consistent with observations in [4] , where it was observed that the sorts of workloads used for trace analysis and benchmarking frequently had significantly lower miss ratios than workloads observed in practice, especially in commercial environments.
Eviction Traffic
Our eviction traffic study (from [ 171) shows the fraction of evicted subsectors that are dirty, and thus must be written back. The average data cache dirtiness we found is higher than the averages reported in [23, 281, but our workload and cache sizes are much larger. Our study did confirm the general trend that larger cache sizes have a larger percentage of dirty evicted blocks.
Our results show that the average fraction of sector dirtiness falls with decreasing subsector size. In a normal (nonsectored) cache, a write to any word of a block requires that the entire block be written back. For a sector cache, only the modified subsectors need to be written back. In some cases, the use of a sector cache can significantly reduce the write traffic.
With the values for the eviction of dirty data determined here and the DTMRs developed in the previous section, we calculated various other metrics, which due to space constraints can be found in [ 171.
Implementation-Dependent Results
Given the DTMRs calculated in the previous section, we develop design target memory delay to aid in characterizing the performance of sector caches. After looking at the basic characteristics of sector caches, we pose and answer the following question: under what circumstances are sector caches useful? How would a designer evaluate the tradeoffs for the various cache organizations? Using memory delay as the primary metric, we evaluate the best designs for single-and two-level cache systems given a certain transistor budget.
A detailed analysis of bus utilization appears in [17]; here we concentrate on memory access delay as the most appropriate performance metric. The real performance metric for a memory system is the effective memory access time, also called the average memory access delay. The access delay in an n-level cache hierarchy can be modeled as with mri(Li) the miss ratio (a function of Li, the transfer size) at cache level i with respect to the processor, and ti the miss penalty for fulfilling the miss from level i + 1 in the memory hierarchy. This timing assumes that a hit or determining that a miss has occurred in the on-chip cache takes 1 processor cycle. The miss penalty time is a function of the fetch size (subsector size) Li, the latency ai (cycles until the first word is ready in memory level i + 1 for transfer), di the data width to the next level of memory, and ci the additional time to transmit di bytes. The miss penalty is stated by: 921XYl 3 19613 423181 26lXYXI 2.911747 3 311115 2921XXl 3.793311 3 These simple models ignore such features as queueing delays and stalls due to evicted sector write-backs, but these factors are small for uniprocessor systems [30] . This delay model assumes a burst transfer mode that allows a single address transaction followed by a burst transfer of the number of data words requested. Table 6 shows a sampling of the various cache designs, using design target delays for 16-cycle startup latency, with 4 cycles per word transfer rates. For a given cache size, the best choice to reduce the memory delay is always a nonsector organization. However, there are a few choices that could greatly reduce the number of tag bits to control the cache with only a slight impact on performance. For example, in Table 6 , a 64 Kbyte unified cache with 256-byte sectors and 64-byte subsectors performs a bit worse than a cache using a 64-byte block (delay 2.03 vs. 1.79), but uses approximately one fourth the number of tag bits. However, compared to the number of bits to implement the entire cache, the reduction is somewhat modest (saving 4.12
Design Target Delay
Kbytes or about 6 percent), while decreasing the speed by about 13.4 percent. Determining the best performance for a given budget of bits (contained in [ 17] ), the performance improvement from choosing a sector cache organization for a single-level cache is generally on the order of 2 percent or less for reasonably sized caches. We can conclude that sector caches show very little advantage for on-chip caches. Designers were justified in abandoning sector caches for single-level caches. As shown above, sector caches seldom yield significant improvements for single-level caches, but as we show in this section, they can be quite useful in a system with a two-(or multi-) level cache. In such a multi-level system, the first level is usually fairly small, and the second level can typically be made quite large. It is helpful in a multi-level cache to have the tags for the second level present at the first level, since this can save several cycles in processing a reference that misses to the first-level, and then either hits or misses at the second-level. The useful feature of a sector cache in this case is that the data arrays for a large off-chip cache can be controlled by a relatively small number of on-chip tag bits.
Two-Level Cache Designs
For example, the address tag and other overhead bits for a 512K instruction cache using 64-byte blocks requires 44
&-Chip
Tags Only B i e Ued On Chin Kbytes to implement. By using a sectored organization with 512-byte sectors and 64-byte subsectors, only 6.03 Kbytes are needed to manage the cache. The savings in the number of bits required to manage a second-level cache of a given size can then be used to increase the size of the first-level cache; alternately, a larger second-level cache can be used than could have been managed with the non-sector design.
To compare various choices for two-level caches, we assume that the startup latency to access main memory is 24 (processor) cycles, with a bandwidth of one word (four bytes) every four cycles. A level-one miss causes an access to the level-two cache, requiring a startup time of one cycle, with two words (64 bits) transferred per cycle. This is a 2-1-1-1 burst mode transfer from .the level-two cache to the level-one cache on a level-one cache miss (two cycles to receive the first word, one word each cycle after). This setup assumes that the second-level cache is interposed between the first-level cache and the memory bus (serial organization. One other possible two-level cache organization (parallel cache), uses the memory bus for all L1 and L2 miss transactions, which is slower than the serial organization. The trade-offs between these two organizations can be found in [ 13.
Since there are literally thousands of possible combinations of first and second-level caches over the range of cache sizes and designs that we have considered in this paper, we have selected a few representative cases. In each case, we present the best organizations for a given on-chip bit budget using the timing mentioned above in combination with design target miss ratios. For each type of cache (unified, instruction, data), we examined six different cache organizations under the same timing constraints. The different organizations consist of normal (non-sector) and sector versions of an off-chip cache (with the tags on-chip) (Figure 3a) , a single-level on-chip cache (Figure 3b ), and a two-level cache with the level-one cache and tags for the level-two cache on-chip (Figure 3c ). For a given number of bits, the best organization is given (Table 7) ; see [I71 for more complete tables and also for instruction caches). These tables assume a 48-bit virtual address and eight-way set-associativity for both levels of cache. For a given total on-chip cache bit budget, the best choices under the various constraints (on-or off-chip cache, sector vs. normal cache) are shown. The sector cache choices for each organization will be as good as or better than the normal cache, since the normal cache organizations are a proper subset of the sector cache choices. The two-level caches will be as good as or better than the single-level off-chip cache as well as the single-level on-chip cache, since those organizations are considered proper subsets of the two-level caches.
Off-chip organizations (some with very small on-chip caches) were used by the early members of the PA-RISC 7000 series, although they were able to run the off-chip cache at the same rate as the processor, providing single cycle penalties [ 111. Newer generations of processors are unlikely to be able to run off-chip caches at the same rate as the processor, due to the access time of large caches (which increases with cache size due to capacitive loading) and the delay caused by the off-chip interconnection wires, which have much higher inductive, capacitive, and resistive values than the much smaller on-chip interconnects.
From the tables, we can see that the sector cache organizations are the best choices when the number of on-chip bits is small, due to more flexibility in the organization. Table 7 demonstrates how powerful the sector organization can be. Many of the best choices for a two-level cache memory system utilize sectors for the first-and/or second-levels. The reduction in delay by using a sector cache organization is very significant for gross cache sizes of 20 Kbytes of space or less.
For large numbers of on-chip cache bits, sector caches are sometimes useful, but even when useful, yield only small performance improvements. Note that although chip sizes are rapidly increasing, two-level sector caches are still very useful for future designs. Optimal design is not making the biggest chip, but the best chip, trading off cost and performance. Making a smaller chip is much cheaper. Alternately, putting multiple CPUs on a large chip (thus reducing the area available for caches) may be a preferred approach. Thus the space available for on-chip cache storage will likely continue to be very limited. Table 8 shows a number of real cache designs, and contrasts the level of performance with that possible from a sectored design (assuming the timings used earlier in this section). In most cases, better performance, sometimes quite significantly so, can be had with fewer or similar amounts of on-chip resources. For single-level caches, it is worthwhile to divide the cache size in half in order to provide tags to control a large second-level cache. Two-level caches can also be reorganized to control a larger cache with fewer tags. It should be noted that some of the processors already use sector caches. Some examples of this are the IBM 601 [ 131 and the UltraSPARC [29] (shown in Table 8 ). Sector caches can be used to enhance performance in real systems, often at a smaller overall cost, measured in delay time and on-chip space.
Sample Organizations
Conclusions
In this paper, we have provided a thorough analysis of the design trade-offs for sector caches and have determined the circumstances under which they are better than normal, non-sectored caches. Using miss ratios and other statistics from the raw trace driven simulation of our multiprogrammed workload, design targets were developed. These design targets included miss ratios and memory delay (average memory access time).
Using the number of bits in a cache and delay, we examined the best performing caches for a given bit budget. We found that sector caches are useful in boosting system performance, particularly in situations with limited numbers of bits. In these caches, some of the bits are best utilized as tags for off-chip second-level caches. Only as the secondlevel cache size becomes quite large can additional bits be employed for the on-chip cache to aid in reducing memory access time. Dividing a sector into subsectors aids in controlling large amounts of off-chip cache with as few on-chip address tags as possible, while reducing the traffic that using large blocks causes.
This work can be extended in several ways. One problem with sector caches is that an estimated 72 percent [7] of the subsectors are not referenced while a sector is present in the cache; our data shows the amount of unused subsectors ranges from 6 percent to over 9 0 percent. Research is required to more effectively utilize the data space in sector caches, such as along the lines of [20, 181 , which shows a method of dis-associating subsectors from sectors to reduce tag space without adversely impacting performance. More demanding workloads should be found to analyze caches and push them harder, as common workloads such as SPEC have been found lacking in utilizing larger caches [4] . Other work, in progress, is to examine the utility of sector caches in a shared memory multiprocessor.
