Maintaining a low tag array size is a major issue in many cache designs. In the decoupled sectored cache we present in this paper, the monolithic association betwewen a cache block and a tag location is broken; the address tag location associated with a cache line location is dynamically chosen at fetch time among several possible locations.
Introduction
The tag implementation cost is a major issue in many cache designs and particularly for the design of L2 cache controllers. In order to reduce this tag implementation cost, sectors have been used in many cache designs for more than 20 years. A sector consists of several contiguous cache blocks associated with a single address tag 1 . The size of the tag array in a sectored cache is signi cantly lower than the size of the tag array in a non-sectored cache. Unfortunately, on many applications, a sectored cache exhibits signi cantly higher miss ratios than a non-sectored cache.
The aim of the decoupled sectored cache 8] is to conciliate low tag implementation cost and low miss ratio. In a traditional sectored cache, each cache line location is statically linked to one and only one address tag location. In the decoupled sectored cache, this monolithic association is broken; the address tag location associated with a cache line location is dynamically chosen among several possible tag locations.
The remainder of the paper is organized as follows. In section 2, we show why maintaining a low tag implementation cost is a major issue in many cache designs, and particularly, in L2 cache designs. In section 3, we introduce the decoupled sectored cache organization. In section 4, trace driven simulation results are presented. It is shown that, at comparable tag sizes, L2 decoupled sectored caches exhibit better hit ratios than traditional sectored caches and even allow to reach the same level of performance as a non-sectored cache, but at a lower hardware cost. Section 5 concludes this study.
Tag arrays 2.1 Tag implementation cost in caches
In traditional caches, a tag word is associated with each cache line. This tag word consists of an address tag and some other status tags ( gure 1); the address tag allows to retrieve the e ective address (virtual or physical) of the data block stored in the cache line. Limiting the size of the tag array for a cache is an important issue for some cache designs. In many microprocessor systems, the tag array has to service two accesses per cycle: a snooping transaction on the bus for maintaining cache coherency and a transaction from the processor for accessing the cache. Then the tag array has to be dual-ported while the data array is single-ported. Therefore, when the cache is o -chip, it is not cost-e ective to build the tag array with the same RAM chips as the data array itself. Including the tag array and the whole control logic for the cache in a single companion chip is an attractive solution that has been adopted for the design of many L2 caches. In this case, the maximum size of the tag array is automatically limited by integration density. 1 There is no general agreement on technical terms; for some authors, the cache block corresponds to our cache sector and consists of several contiguous cache subblocks (a cache block for us)
In order to conciliate small or medium line size with a low tag array size, many cache designers have used sectored caches 2] ( gure 2). In a sectored cache, a cache sector consists of a single address tag and several contiguous cache line locations. Each cache line in the cache sector has its own coherency and validity tags, but the unique address tag represents the memory address for all these cache lines i.e. when two cache blocks are valid in a single cache sector, their addresses only di er by the sector o set. In a sectored cache, the size of the tag array is signi cantly lower than the size of the tag array in a traditional cache using the same line size.
Notation: for convenience, we shall refer to a sectored cache where each sector consists of P cache lines as a P-sectored cache.
Miss ratio on sectored caches On sectored caches, two kinds of misses may be distinguished :
A) Sector misses: the referenced block is missing and no block in the same memory sector is alive in the cache. Sector misses correspond to the misses which would have occurred if the line size was the sector size. On a sector miss, all blocks belonging to the replaced sector have to be invalidated. B) Cache block misses: the referenced block is missing, but some other block(s) of the same memory sector is (are) alive in the cache. Then in many applications, using equal line sizes a sectored cache results in a higher miss ratio than a non-sectored cache.
3 Decoupled sectored caches
Principle
In a sectored cache, a cache line location is physically linked to one and only one address tag ( Figure 2 ). All the valid data blocks in a cache sector belong to the same memory sector. In a decoupled sectored cache, this static association is broken.
De nition 3.1 In a decoupled sectored cache, the location for the address tag associated with a cache line location is partially dynamically determined at fetch time. Figure 3 illustrates an example of such a decoupled sectored cache. The data array is divided in cache sectors consisting of eight contiguous cache lines; two address tag locations are associated with each cache sector. A valid data block may belong to any of the two memory sectors represented by these address tags. For each cache block, a single selection tag bit S allows to retrieve the associated address tag. This is illustrated for a cache sector.
De nition 3.2 We shall call a (N; P)-decoupled sectored cache, a decoupled sectored cache for which:
1. There exists a number N such that, for any cache line LI , the address tag associated with a block stored in LI is dynamically chosen among N possible address tag locations 2. Each address tag can correspond to up to P cache lines, i.e. the memory sector consists of P contiguous data blocks. We shall refer to N as the decoupling degree and to P as the sectoring degree. From now, we shall restrict our study to the (N,P)-decoupled sectored caches.
Selection tags In a (N; P)-decoupled sectored cache, a log 2 N-bit tag must be associated with each cache line in order to allow to retrieve its associated address tag. From now, this tag will be referred to as a selection tag.
Decoupled sectored cache and associativity
In a sectored cache, the associativity degrees of the tag array and the data array are equal. On a (N; P)-decoupled sectored cache, these two arrays are more independent; the usual associative organizations (directmapped, set-associative, fully associative) might be considered independently for these two arrays. These cases are detailed in 8].
We illustrate here the tag checking in the general case where the address tag array and the data array are both set-associative.
To simplify gures and notations, the o set in the cache blocks will not be represented in the remainder of the paper. Let us divide the address C of a memory block in four bit substrings (C3,C2,C1,C0), where:
1. C0 is a log 2 P-bit string. 2. C1 is a log 2 CS P Assocdata -bit string; CS being the cache size, Assocdata being the associativity degree of the data array. (C1,C0) is the number of the set where data at address C may be stored. 3. (C2,C1) is the number of the set in the address tag array where the address C may be represented. 4. C3 is the string of the remaining highest signi cant bits in address C. The memory block at address C can be stored in any cache line in set (C1,C0) of the data array, and its address can be stored in any address tag in set (C2,C1) of the address tag array.
Let us suppose that the memory block at address C is valid in the cache then: 1. The memory sector at address (C3,C2,C1,0) is represented by some address tag in the address tag array (only C3 has to be stored). Let us call X the position where it is represented in the set (C2,C1) in the address tag array. 2. The block C0 of the memory sector (C3,C2,C1,0) is valid in the cache array: the selection tag S associated with the cache block in the cache array must allow to retrieve the precise address tag associated with the cache block i.e it must allow to know that this address tag is stored at position X in the set (C2,C1). In order to be able to nd out these information, the selection consists in two parcels a position parcel where X is stored and a set parcel where C2 is stored (C1 is implicitly known). Figure 4 illustrates the tag checking on a (N,P)-decoupled sectored cache where both the address tag array and the data array are two-way set-associative.
Size of the address tag array A (N; P)-decoupled sectored cache of size CS and line size CL consists of CS CL cache lines. Every cache line in a xed set of the data array belongs to a xed collection of N sectors. This collection of N sectors represents the tags for P consecutive sets in the data array. Then N address tags represents P Assocdata cache lines.
Then the address tag array of a (N; P)-decoupled sectored cache of size CS and line size CL consists of N CS P CL Assocdata address tags.
Remark that C2 address substring is log 2 N AssocAddress -bitwide where AssocAddress is the associativity degree of the address tag array. Then the tag array maps a total space of size N CS Assocdata with a granularity of P CL bytes.
Tag implementation cost in decoupled sectored caches
In order to roughly estimate, the implementation cost of tags in a decoupled sectored cache, we chose to measure the number of storage bits needed in the tags. This measurement gives a rst order estimation of the implementation cost of the tag array. More complete analysis of this implementation may be done using more sophisticated models as the MQF model for instance 6]. Only tags needed for retrieving the address of the data block stored in a cache line will be considered here i.e address tags for non-sectored and sectored caches and address tags and selection tags for decoupled sectored caches.
The width W AT (in bits) of the address tag depends on the width of a memory address b,the size of the cache CS = 2 s , the associativity degree of the data array Assocdata = 2 D , the associativity degree of the address tag array Assocaddress = 2 T , and the decoupling degree 2 n . This width W AT is given by: W AT 
Assuming that the width of a memory address is 36 bits and that 32-byte lines are used, table 1 (resp. table 2) presents the size of some address tag arrays for sectored and decoupled sectored caches assuming direct-mapped data arrays (resp. 2-way set associative) with the size of the tag array of a non-sectored cache. In this table, address tag arrays of the decoupled sectored caches are considered to be direct-mapped.
The width of a selection tag only depends on the decoupling degree. Hence, as a selection tag is associated with each cache line, increasing the sectoring degree does not reduce the contribution of the selection tags in the size of the tag array. On a 256-Kbyte (4,P)-decoupled sectored cache, the width of the selection tag represents 1 9 th of the width of an address tag on a 256-Kbyte non-sectored cache: 2 bits per cache line instead of 18 bits per cache line. Thus, increasing the sectoring degree over 64 will not signi cantly decrease the size of the tag array.
Notice also that the decoupling degree must also be kept relatively small.
Invalidating sectors
When the address of a memory sector is missing in the address tag array, another memory sector address has to be invalidated in the tag array. In this case, all the valid cache blocks from this rejected memory sector have to be invalidated in the cache. Invalidating a sector in a (N,P)-decoupled sectored cache is slightly more complex than in a sectored cache:
On the invalidation of the sector S, the selection tags of the P Assocdata cache locations where cache blocks belonging to S might be stored have to be checked, and possibly invalidated. In the worst case, P cache blocks have to be invalidated and, may be, written back on memory (or on an extra level of cache). Such a worst case might result in a burst of write back tra c to memory. The complexity of the control logic for this invalidation and the sector miss penalty depends on the precise cache controller implementation and particularly on the number of selection and validity which can be checked in parallel.
Nevertheless, the simulations presented in Section 4 will show that: Sector invalidations are quite rare. The average number of invalidated blocks per sector invalidation is low.
Performance evaluation
Trace driven simulations have been run in order to compare the behavior of decoupled sectored caches with the behavior of conventional sectored caches using comparable address tag array size.
Simulation parameters
Uniprocessor running a single application was simulated. Results for L2 caches are reported here; results of simulations of L1 caches may be found in 7]. The Spa package developed by Gordon Irlam 5] was used to generate address traces for programs executed on a SUN SparcStation2. No modi cation of the binary to be analyzed was required; user code of a single application is completely traced. 10 benchmarks were used. 8 benchmarks were chosen from the SPEC 92 collection including 3 integer benchmarks (espresso: esp, compress:com, and gcc) and 5 oating-point benchmarks (doduc:dod, tomcatv: tom, alvinn: alv, swm256: swm256, fpppp: fpp). We also used as benchmarks, two of our C applications cac, a cache simulator and net, the simulator of a particular interconnection network.
Addresses were rst sent through the simulator of a rst-level write back cache. Only references missing on this rst level cache were sent to the simulator of the second-level cache. Then this unique sequence of references was simulated on multiple con gurations of L2 caches. The simulated L1 cache was uni ed and 2-way set-associative; its size was 16Kbytes, and its line size was 32 bytes. The rst billion references (when available) were simulated for all the benchmarks.
Only 256 Kbyte L2 caches are illustrated here, more results are given in 8].A 32-byte line size was used. Global miss ratios (i.e
Number of misses
Number of memory references ) on non-sectored caches for all the benchmarks and associativity 1 and 2 are given in table 3 2 .
Simulation results are presented for 8-sectored caches and decoupled sectored caches with comparable tag array sizes were simulated (see tables 1 and 2); i.e (4,64)-decoupled sectored caches when the data array is direct-mapped and (4,32)-decoupled sectored caches when the data array is 2-way set-associative.
Performance metrics
Cache misses Decoupled sectored caches and sectored caches will exhibit higher miss ratios than nonsectored caches. As one can expect these miss ratios to be in the same range, we use the relative miss ratio as a metric:
De nition 4.1 (Relative miss ratio) Let Ca be a non-sectored cache with size S, line size CL, and associativity degree A, let Ca 1 be a sectored (or decoupled sectored) cache with size S, line size CL, and associativity degree A, we call relative miss ratio of cache Ca 1 , the ratio Number of misses on Ca1
Number of misses on Ca Sector misses On a decoupled sectored cache or a sectored cache, some extra penalty is paid on a miss which induces a sector miss. In order to show that such invalidations will not be so frequent, the ratio sector misses cache misses will be illustrated.
On a sector invalidation, up to P cache blocks have to be invalidated in the cache. The average number of invalidated blocks per sector miss will also be illustrated. Tables 4, 5 and 6 illustrate respectively the relative miss ratios,the ratio sector misses cache misses 3 , and the average number of invalidated blocks per sector invalidation for 256Kbyte direct-mapped L2 caches on our benchmark set. For decoupled sectored caches, decoupling and sectoring degrees and associativity degrees of the address tag array are indicated.
Direct-mapped data array
Sectored caches On most benchmarks, the miss ratio on sectored caches is signi cantly higher than on non-sectored caches.
Direct-mapped tag array When both tag and data arrays are used in a decoupled-sectored cache, there is only very little bene t from using a decoupled sectored cache structure when both data and tag arrays are direct-mapped.
This quite disappointing behavior can be explained by two phenomena: 1. The ratio of sector misses is quite signi cant for most applications (table 5) . For instance, in the illustrated experiments, the tag array in the (4,64) decoupled sectored cache maps 1 Mbytes with a 2Kbyte granularity. When such a large granularity is used, it is well known that the behavior of direct-mapped caches is dramatically a ected by con ict misses. 2. Con ict misses on sectors lead to invalidations of useful blocks. For instance on fpp, the average number of invalidated blocks per sector invalidation is very high.
Set-associative tag array The maximum bene t of using decoupled sectored caches is obtained when the address tag array is set-associative. We can observe that, when the data array is direct-mapped, at the exception of our two applications (cac and net), the miss ratio on a decoupled sectored cache is very close to the miss ratio on a non-sectored cache. Even on these applications, the performance is higher than on non-sectored caches. Due to the associativity of the tag array, the ratio of sector misses is very low except for our two personal applications.
Moreover, the number of blocks invalidated per sector miss is very low (table 6) . As a LRU policy is used on the tag array, the rejected sector tends to be an empty (or near empty) one. Tables 7, 8 and 9 illustrate respectively the relative miss ratios,the ratio sector misses cache misses and the average number of invalidated blocks per sector invalidation for 256Kbyte direct-mapped L2 caches on our benchmark set.
Set-associative data array
Direct-mapped tag array Using a direct-mapped tag array and a set-associative data array in a decoupled sectored cache leads to very low performance. As when using direct-mapped data array, this behavior is explained by a large fraction of sector misses (table 8) .
Set-associative tag array When both the data array and the address array are set-associative, the hit ratios on a decoupled sectored cache are also better than on sectored cache; nevertheless the relative hit ratio improvement is less impressive than when the data array is direct mapped.
Conclusion
The size of the tag array is an important issue e.g. for the design of controllers for second level caches when the tags have to t in a single chip.
Sectored caches have been used for many years for conciliating a low size of the tag array and a reasonable cache line size -and then a reasonably small grain for data transfers. Unfortunately, in many applications, sectored caches lead to signi cantly higher miss ratios and memory tra cs than non-sectored caches.
In sectored caches, a cache line is statically linked to a single address tag location associated with a sector. In the decoupled sectored cache we have proposed, this static link is replaced by a dynamic choice at execution between several address tag locations; thus several memory sectors are sharing a set of cache lines.
The associativity degree of the address tag array in a decoupled sectored cache may be di erent from the associativity degree of the data array (see section 3); for instance, in order to enable fast read hit time 3], the data array may be direct-mapped while the address tag array is set-associative.
Trace driven simulations for the uniprocessor case presented in section 4 have shown that, at comparable tag array sizes, a decoupled sectored L2 cache with a set-associative address tag array achieves signi cantly better hit ratio than a sectored L2 cache, particularly when the data array is direct-mapped. Decoupled sectored caches even achieves miss ratios close to the miss ratio on a non-sectored cache while the tag array is four or ve times smaller.
A sector invalidation on a decoupled sectored cache requires more hardware logic than a sector invalidation on a sectored cache; but simulations have shown that such sector invalidations would be less frequent on decoupled sectored caches. (1,1) (1,4) (1,8) (4,16) (4,32) (4,64), 4-way 1.23 1.00 1.00 1.00 1.00 1.31 1.04 1.04 1 Table 9 : Block invalidations per sector miss on 256Kbyte 2-way set-associative caches 
