Most newly announced
Information stored in caches consists of data and tags. Address tags allow the retrieval the addresses of blocks, whale vahdity, dirty and coherency ta s allow f data consistency maintenance in the distinct evels of a memory hierarchy system.
Most recent high erfor-$ mance mlcropro cessors sup ort 64-bit virtual ad resses ! while the width of physlca addresses is also growing. As a result, the size of the add~ess tags m traditional caches is increasing. This is particularly dramatic when *This work was partially supported by CNRS (inter-PRC project ILIAD)
Permission to make digitalhard copy of part or ail of this work for personal or classroom use is ranted without fee provided that copies are not msde i or distributed for pro t or commercial advantage, the copyright notice, the title of the publication and its date appear, and notica is given that copying is by permission of ACfvl, Inc. To cupy otherwise, to republish, to post on servers, or to red~tributa~lists, requires prior specific permission andior a fee.
ISGA '86 W88 PA, USA @ 19$8 ACM O-89791 -788-W9810005...$505O small data block sizes are used. Then the tag im le-:" ment atlon cost becomes an Important issue for on-c lp caches.
Performance of su erscalar microprocessors de ends J' on the accuracy of ynamic branch rediction. J' " 'arge Branch Target Buffers must be use m rocessors m order to allow very accurate prediction.
For a Branch Target Buffer, the implementation costs for both tags (i.e., branch addresses) and data (i.e., target addresses) increase linearly with the address width. The silicon area oqcupied by a Bran~h Ta~get Buffer can become quit e slgmficant.
Decreaemg this area 1s also becommg an issue for microprocessor designers. In current microprocessors, the page number associated with a cache block is represented two or three times, first as a part of the address tag associated with the cache block, second w a age address in a TLB en-1 try for virtual-to-physical ad ress translation, and often a third time in the branch address or the target address. Furthermore, these data are read at the same time and com ared during the tag check.
Tie contribution of this paper is to show that this rephcation maybe removed by applying the simple principle:
Do not use the page number, but a potnter to it
The remainder of the paper is organized as follows. In Section 2 we illustrate how dramatic the hardware cost of L 1 cache address tags and of Branch Target Buffers can be. In Section 3, we propose a solution for limiting the tag siz~for on-chip caches, the indirect-tagged cache.
In the redirect-tag ed cache,. the page offset in an address tag 1s replace~by a pointer to a specific cache where page numbers are stored. We term this a page number cache or PN-cache. The PN-cache ma be the TLB for virtually tagged caches or a physical *Ncache. The tag check usin an redirect-tagged ph sicallỹ %~cache tagged cache 1s.sim Ier t an with~onven$lona designs.
The size o the tag array m an mdlrect-tagged cache does not de end on the address width.
The "f cache miss ratios of m lrect-tagged caches are shown to be very close to the cache miss ratios of conventional caches. The execution overhead associated wit h the use of an indirect tagged cache is also shown to be small. In Section 4, we present a similar solution for the Branch Tar et Buffer. Let us consider a 4-way set-associative 512-entry Branch Target Buffer (e.g., the Intel PentiumPro), a N bit address -width, and let us assume a 4-byte constant inEach entry in the Branch &rget Buffer consists of structlon length and a 4Kbyte age size.
N-9 tag bits (the 9 lead significant bits in the branch address are implicit) plus N-2 data bits (i.e, the target address minus the two implicit lowest significant bits In modern superscalar processors, the page number of a branch or a target is also represented.
In the remainder of the paper, we will show how to remove this redundancy applying two simple principles: 1. Store page numbers only once 2. Do not use the page number, but a pointer to it 3 Indirect-tagged caches
In order to retrieve the memory address of a cache bIock or a pa e in the processor, the page number must be represent% in the processor at least once. We~ro-pose that it be stored in a single table. This table W1 be called the puge numbe~cache or PN-cache. For on-chip caches, a pointer to thw PN-cache may replace the normal re resentation of the page number in the address {" tag. T 1s pointer will be called the indirect ta " 2" Caches ma be virtuall or physically m exed de-[c/' pending on w ether the a dress used for indexing the cache is the virtual address or the physical address of the referenced data in memory.
Caches may be virtually or physically tagged depending on whether or not the stored tag is the ph sical or virtual address of the data block in memory.~n practice,~hys~;~ll;-~~ṽ mtually-tag ed caches are rarely use . T %" cases are stu led.
For convenience, the following abbreviations will be used:
. The mechanisms proposed in this section are intended to replace only the pa e number art of the tag % m a cache by a pointer.
In or er to simp ify fi ures, we YE shall assume in the remainder of the section t at (number of sets) * (biock size) < page size). Then the address tag is included in the page number (VV caches) or is the page number (VP or PP caches).
VV indirect-tagged caches
When using a VV cache, the TLB ma be used as "J the PN-cache.
Fi ure 2 illustrates an m lrect-ta~ged direct-ma ped V $ $ cache. The indirect ta~assoc~ated with the at a block is a oint er to the TL entry. On ! a cache access, the num er of the matching entr of $ the TLB is checked against the tag of the selected ata block(s).
Then for each data block represented in the cache; the associated page number must be represented in the TLB.
As we assume that the size of an associativity degree in the cache :s greater or equal to the page size, the tag assoclat~d with a block belon in to page P must allow i% the retrieval of the index oft e LB entry where page P 1s represented.
As an example, when considering a 64-entry TLB, a 6 bit selectlon tag is sufficient instead of a whole vwtual address when using a conventional cache design. in the TLB ma rnatc the same physical page. The PN-cac e 1s not systematlcall read: effective J physical addresses would only be nee ed for cache accesses resulting in an external transaction (e.g., cache miss, access to a shared data).
PP indirect-tag caches
In order to avoid memory consistency maria ement i associated with virtually indexed caches, PP cac es are used in many designs (e.g., TI SuperSparc).
The major drawback of PP caches is that, the physical address, or at least some of the least order bits is needed for indexing the cache, thus, requmn that the % address be translated before indexing the cac e. In order to allow address translation m parallel with a cache lookup in PP caches, tying cache size and associativity to a e size has been extensively used in real machines. l?he?design proposed for indirect-tagged VP caches may be used for this class of PP caches.
Fi ure 4 illustrates a "true" indirect-tagged PP cache %" for w lch some of the low order bits from the hysical t 1+ pa e number are concatenated to the page o set for m exing the cache.
3.4
Tag check
VV caches
On a conventional VV cache, the tag check result is known after the following delay:
On an indirect-tagged VV caches, this delay is :
This delay }s likely to be slightly longer than the delay on conventional caches.
VP caches
On a conventional VP cache, the tag check result is known after the following delay:
On VP indirect: tag ed caches, the tag check result is f known after a shght y shorter delay:
The tag check is sim ler on indirect-tagged VP' cache than on corn-entional ?P cache designs. On indirecttagged VP caches, the tag check consists in comparing For instance, when a 64-entry PNi cac e is used, 6-bit chains have to be compared instead of complete page numbers in conventional cache .designs. The cache hlt time 1s then shorter or equal on mdlrecttagged VP cache than on a conventional VP cache.
Page invalidations
The indirect address tag of a block in the cache must point to the age number stored in the PN-cache. Then "f q when invah atmg a pa e entry in the PN-cache of an indirect-t agged cache~a 1 data blocks belonging to this page have to be invahdated ig the cache. Complex hardware mechamsms would be required in order to invalidate a full page in a very short delay (a few cycles). In order to limit the hardware complexity of the invalidation mechanism, we propose to use the following replacement ,policy:
The number of valtd cache blocks be[ongin to each page is stored with the address tag in the $N-cache. Among the possible pages, "empty" pages i.e. pages with null counters are the jirst choice for replacement. 1.
For each pa e entry a few storage bits for the counter c?" must be store m the~N-cache.
Since+ this counter only @ accessed on .a cache miss, the up d,atmg o.f the counter 1s not the crltlcal path, then accessing a single counter on each cycle is sufficient.
Simulations presented in 3.10 show that page invalidations are quite rare and that, with using the reposed mechanism, invalidation of pa es with valid locks is i" { very uncommon.
This allows t e implementation of an invalidation mechanism with a limited hardware complexit y:
q "Empt " pages are invalidated in a few cycles: a single $N-cache entry has to be invalidated. The invalidation of pages with valid blocks is done sequential "d"
: all blocks from the pa e are searched %" " and invah ated sequentially.
For V mdlrect-tagged caches the corresponding TLB entries are also invalidated.
Tag implementation cost 3.6
VV caches The size of the ta array in an indirecttagged VV cache does not depen cfonthe address width. 
4-way set-associati ive .TL13, and a cache size equal to t e page size only a '?-bit select~on tag has to be associated with eac~cache block : this 1s.approxlmately one sixth of the current size in conventmnal caches for a 48-bit virtual address + 8-bit Process ID.
VP caches
Storage bits are saved both in the cache and in the TLB. But a PN-cache is added.
Let us consider the following example:
1. a split I/D cache, each 16 Kbyte 4-way set-asociative cache 2. a 64-entry ITLB and a 64-entry DTLB 3. a 256-entry 4-way set-associative PN-cache 4. a 4Kbyte page size 5. 40-bit physical address When considering a 32-byte block size, in a conventional cache, 2 x (512 x 28 + 64 x 28) = 32256 storage bits arẽ~~e for storing physical addresses in the TLB and the Using an indirect-tagged VP cache, a total of 2 x (512 x 8 + 64x 8) + (22 + 9) x 256 = 17152 storage bits would be used in the cache, in the TLB and in the PN-cache (including the counters of valid blocks in the PN-cache).
When the block size is only 16 bytes, the volume needed for storing physical addresses grows more slowly on indirect-tagge~-VP cache (25600 storage bits) than on conventional caches (60928 storage bits).
On-chip memory hierarchy
Some processors implement on the same chi a complex memory hierarch t' $ For instance, the D C 21164 sup orts on the same c 1P instruction and data L1 caches 2 an a L2 cache. Indirect-tagged caches ma be used for i all three caches and only a single PN-cac e has to be used.
PN-cache accesses
When indirect-tag ed VP caches are used in the pro-% cessor, the page num er of a data block is. re resented b only once. in, a centralized PN-cache. This N-cache serves as mdlrect address tag array for several usages (TLBs, L1 caches, L2 caches). It is important to note that th~s PN-cache does not have to be explicitly for cache hits or TLB hits, but during less frequent cache misses.
Thus the PN-cache does not need to support multiple parallel accesses from the processor.
However, many processors Implement an additional access po~t to the cache tag array fo~bus. snooping. In order to lm lement thee same functlonahty, an access f port would ave to be Implemented on the PN-cache.
3.9
Off-chip PN-cache implementation
We have noted that the PN-cache is only accessed on incidqnts resulting in an external transaction.
Then the physical PN-ca$he maybe implemented outside the processor chip, for instance m an external L2 cache controller.
Such a solutlon would save slhcon area on the chip and would hmlt lts pm count: the effectwe physical address has not to be known m the chip, then only pointers to thw address have to transit through the pros.
When an external L2 cache controller is used, a decoupled sectored cache structure
[15] might be used. Combining the physical PN-cache and the L2 cache address tag array of the decoupled sectored cache would limit tag cost at all levels.
3.10
Performance evaluation
We conducted two sets of trace-driven simulations in order to vahdate our approach. In the first set of simulations, we considered a classic split I/D L1 cache organization.
In the second set of simulations, we simulated a DEC-21164-hke memory hierarchy. We used the Spa 
No modification of the binaryto be analyzed was reqmred; user code of a single apphcatlon 1s completely traced.
Benchmark selection A number of benchmarks were tried: the SPEC benchmarks, SPLASH [13], and several others. As expected, applications which do not touch a significant number of pages exhibited roughly the same number of cache misses on an indirect-tagged cache as on a conventional caches. Accordingly our experiments will employ only the 10 benchmarks which mampulat ed more than 370 4Kbyte pa es during their first 100 million memory references.
$'hese numbers are given in Table 1 .
The illustrated set comprises 7 SPEC92 benchmarks ( eqntott eqn , gee, wave5 wav , tomcatv tom , ear, swm256 swm, su2cor SU2,, 3 SPLASH benchmarks (barnes bar, cholesky cho, pthor, pth). A multiuser trace was also used; the 5 following benchmarks were used to build this trace: cholesky, t omcatv, net (a personal application), LocusRoute (a SPLASH benchmark) and wave5. A context switch was assumed every 500,000 memory references, and 500 million references were simulated for this workload.
Statistics on these benchmarks are reported in Table 1. They include the global miss ratio for split I/D 8Kbyte and 32 Kbyte direct-mapped caches (assuming 16-byte block size) and the miss ratio on a 3-way setassociative 96 Kbytes unified L2 cache (assuming 32-byte block size).
Performance metrics Since page misses in the PNcache enerate some. cache blocks invalidation,, indirecttagge f caches are hkely to exhlblt higher miss ratios than~onventlonal ones. But we expect these m~ss ratios to be m the same range, then we compare the miss ratios using the relative miss ratio defined as: The block size was fixed at 16 bytes and the page size at 4Kbytes. Split I/D caches and split I/D TLB were simulated. A single PN-cache was assumed. Two cache configurations were simulated: 8Kb te direct-a ped caches + 32-entry fully-associative TL~, and 321&yte directmapped caches + 64-entry full -associative TLB. For k the indirect-tag ed caches, the P -cache size which were f considered are 28, 256 and 512 entries. The PN-cache was assumed to be 4-way skewed-associative [14] . The skewed-associative structure was chosen because it has been shown to better distribute the accesses over the whole cache than a set-associative structure. Particularly, the PN-cache re lacement policy pro osed in 3.5 ! I? is more efficient on as ewed-associative P -cache than on a set-associative PN-cache. Assuming a 40-bit physical address} the res ective stora e sizes occupied by address ta s in the istinct F %x sirnu ated caches are illustrated in Ta les 2 and 3. Onchl~~~~~~~~~t~~~~;hown in Tables 4 and . 5. First, notice that for all the benchmarks and for the two simulated configurations, using an indirect-tagged cache instead of a conventional cache structure does not change the range of the miss ratios.
As expected, a slightly larger PN-cache is needed when the cache is larger. This can be explained easily: m a Iar e cache data blocks are likely to remain vahd f longer t an in a smaller cache. On the other hand, using an indirect-tagged cache saves more silicon area on a large cache than on a smaller one (Tables 2 and 3) : while a 512-entry PN-cache is clearly too area consuming on a 8Kbyte, it allows to save around 8Kbytes of storage when considering a 32 Kbyte cache.
As far as only the cache miss ratio is considered, a 128-entr PN-cache seems to be sufficient. Neverthe-[ less, per o~mance also will be determined by the delay s ent .on mval?datm pages m the page entry. When E 1 t ere 1s no vahd bloc s from the age m the cache, the invalidation is quite immediate. l!he penalty paid when there is a valid block will be higher (roughly one cycle per cache block in the page).
For each benchmark, the total number of invalidated pages is represented in Table 62 . The total numbers of non-empty invalidated pages are illustrated in Tables  7 and 8, it corresponds to 100 million references for each benchmark except for the the multiuser workload which corresponds to 500 million memory references. The total number of mvahdated pages 1s represented in Table 63 . These results show that, considering a PNf cache with 256 or 512 entries, for all our b enc marks the time spent in servicing the PN-cache misses would be quite low besides the time spent on servicing cache mwses. 
Simulations
were done first assuming a conventional ca~he structure then assummg redirect-tagged caches using a single ph slcal PN-cache in the processor. i Considerm t e 40-bit physical address used in the DEC 21164, 'f'able 9 illustrates the number of storage bits used for~storing the physical address tags (or pointers to address tags) in the different configurations.
The area saved when usin indirect-tagged caches is qu?te substant~al:
about 7#bytes when using a 512 entr~es PN-cache. Table 10 ives the relative miss ratios on the L2 cache % for our bent mark set; as for the L1 cache, there is not any significant miss ratio difference with the conventional approach.
The major difference with the previous experiment is the page size (8 Kbytes instead of 4Kbytes).
This results in a significant page invalidation decrease (Tables 11 and  12 ). When using a 512-entry PN-cache, the number of non-em ty page invalidations is insignificant even on f' the mu txuser workload.
Summary
Our simulation results clearly indicate that, on a microprocessor implementing large on-chip caches (large L1 caches cu a cache hierarchy), removing page number representation redundancy would not si nificantly degrade performance on most ap lications w en the PN-cache !% 1s large enough. A lot o space may then be saved on the chip (e.g. see Table 9 ).
As the number of non-em ty page invalidations is "$ quite small, a sequential invah ation of these pages would not significantly affect performance. Table 11: Page invalidations  on a 96 Kbytes L2 cache   wav  bar  Su2  eqn  gcc  pth  ear  tom  swm  cho  multi  12S-entry  18  491  123  50  6096  5ss9  15  533  148S  2735  S4095  256-entry  o  1  4  1  37  164  0  18  96  129  13429  512-entry  o  0  0  0  1  2  0  1  0  5  97   Table 12 : Non-empty page invalidations on a 96Kbytes L2 cache 4 Reduced Branch Target  Buffer In a Branch Target ,Buffer, both the branch address and the target address include a page number. We have shown above how to retrieve the page number for a L 1 indirect-tag ed cache with a pointer tag to a PN-cac.he. f The same 1 ea may be used for BTBs for representmg both the branch target and the address tag.
Assummg that a PN-cache 1s ahead Implemented in the processor, in a Reduced Branch !/&et Buffer, a pointer to the matching PN-ca\he entry of the branch replaces the whole age number m the tag and a pointer \ to the matching ta le entry of the target page replaces the page number m the target address.
Size of Reduced Branch Target Buffers
The storage size required in a Reduced Branch Target Buffer is independent of the address width. The reduction in size of the Branch Target Buffer is dramatic for current parameters as illustrated on the same example of section 2.
A 4-way set-associative 512-entry Branch Target Buffer (as in the Intel PentiumPro), a 40-bit address , a 64-entry PN-cache. a 4-byte constant instruction length, and a 4Kbyte pa e size are considered. % In a Reduced ranch Target Buffer each entry conGists of 9 tag bits (3 bits for retrieving the page offset and a 6-bit page pointer) plus 16 data bits (i.e 10 bits for retrieving the page offset plus a 6-bit page pointer). A total of 25 bits to be com~ared with the 69 bits of a conventional solution. " The total storage size in the Branch Target Buffer is reduced by more than 6070 ! 4.2 Reduced Branch Target Buffers versus NLS table
As previously mentioned in a NLS table [1] , a pointer to a cache location is stored instead of a memory address. At equal numbers of entries, the size of the NLS table proposed in [1] is lower than the size of our Reduced Branch Target Buffer: assuming a 32Kb tes, each l'1 entry in the NLS table .is 13 bit wide. Th18 as to be compared with the 25 bits needed in a Reduced Branch Target Buffer for storing the indirect tag and the indirect address in the previous example.
Nevertheless, the Reduced Branch Tar et Buffer re-%{ sents the two following advantages over t e NLS t a le:
q In a Reduced Branch Target Buffer, a re resent ation of the complete address is predicte % Then, . On a miss, the mg~ructlon block may be {" specu atlvely loaded immediately.
Such aggressive speculative instruction fetches were shown to be effective with a small L2 cache latency [11] .
Using ii NLS table, a position in the, I-cache is predicted.
The mformatlon "I-cache hlt or miss" cannot be computed before knowin the effeqtive f address. Then the servicing of the -cache misses must be delayed until the effective address computation. " The NI,S table@ tag less and then must be directmapped. Mapp~ng confhcts may then lead to tar-% et misspredlctlons.
The Reduced Branch Target uffer :may be set-associative.
Then at e ual en-% try numbers the a set-associative Re(luced ranch Target Buffer will lead to less misspredicted addresses than a NLS More generally, on our benchmark set, we remarked that a 256-entry 4-way set-associative Reduced Branch Target Buffer outperforms  a 1024-entry NLS table when  the instruction miss ratio is significant (more than 1$%), while alternatively, the lower is the instruction miss ratio, the better performs the NLS table. This can be easily explained.
On a branch,. when the target instruction has been ejected from the instruction cache by a miss this leads a missfetch on the NLS-~able.
On a Reduced BTB, the complete target address 1s predicted; the target 1s still missing from the cache} but the servicing of this miss ma be initiated immediately. i On the ot er hand, when the instruction miss ratio is very limited, the large size of the 1024-entry NLStable allows it to reach a high hit ratio slightly higher than the hlt ratio on a 256-entry 4-way set-associative Reduced BTB.
If the access time to the address prediction mechanism is not a critical issue, a 256-entry 4-way setassociative
Reduced Branch Tar et Buffer should be prefered to a 1024-entr NLS tab e, because it requires {f a proximately half of t e number of storage cells and it r a. lows to use an aggressive policy of fetches on instruction misses.
Tag-less
Reduced Branch Target Buffer When the ,agcess time to the address prediction mechanism is a crltlcal issue, a tag-less Reduced Branch Target Buffer or a NLS table may be used.
At equal number of entries, the NLS table exhibits a higher target missprediction rate than the Reduced Bran~h Target Buffers.
But such an extra misspredictlon ?s only encounter~d when the target @struction is mlssmg m the cache: lf the mstructlon miss 1s not serviced before the effective. target computation then such an extra target mwspredlctlon will not affect the execution time.
-" As already mentioned, the choice between the two mechanisms will be motlvat ed bv the Policv on inst ruction cache misses. An ag ressive"instr~ctio~miss servicing. will re uire the pre lction of a com Iete represenIf E tatlon of t e address, thus leading to c ose a tag-less Reduced Branch Target Buffers. Decreasing the implementation cost of both cache tags and Branch Target Buffers is then an important issue. Paradoxically, in current microprocessors, the Jage number associated with a cache block is represent two or three times, first as a art of the address t a associ-E f ated with the cache bloc , second aa a page a dress in a TLB ent~y for virtual-to-physical address translation, and sometimes a third time as a parcel of the branch address or the target address. It 1s even more curious that these data are read at the same time and co,mpared during the tag check.
In this paper, we have shown that this redundancy may be removed by applying two simple principles:
1. Store page number-s only once 2. Do not use the page number, but a pointer to it
We have proposed to represent the pa e number only once in a urnque page number cache or $N-cache. The PN-cache maybe the TLB (Translation Lookasicle Buffer) when virtual tags are used, or an independent cache when ph sical ta s are used.
In the indirect-tag ed llf cache an the Re uced Branch Target Bufferl corn fitẽ $heãddress t a~s are replaced by shorter redirect ta s. using an mdlrect-tagged ca~h~or a ,Reduce Branc Target Buffer, the anachromstlc du hcatlon of tags in 5 i"rocessors 1s removed. The tag chec 1s also sim hfied. he global storage size in the Reduced Branch arget Buffer as well as the tag stora e size in the @direct-tag-% dramatically lower t~an the sizes of conventional impleed cache do not de end on t e address width and are ment ations. As mentioned above, when indirect-tagged caches and a Reduced Branch Target Buffer _are used, the page number of a data is represented only once in a centralized PN-cache (the TLB for virtually tagged cac;he, or a physical PN-cache for physically tagged caches). The PN-cache serves as an indirect address tag array for several uses (TLBs, L 1 caches, branch and target addresses in the Branch Target Buffer, L2 caches;). This PN-cache is not accessed in normal execution (cache hits, TLB hits), but only on incidents resulting in an external transaction (e.g. a cache miss). Therefore the PN-cache has not to be muitiported.
Simulations have shown that the performance of indirect-t agged caches (resp. Reduced Branch Target Buffers) are very close to the performance of conventional designs, at a lower tag implement atlon cost.
On indirect-tagged caches on a page missing in the PN-cache, all the cache blocks associated with the replaced page entry must be invalidated in the caches. Nevertheless, when using a reasonably large PN-cache (e.g. 512 entries), the number of non-empty replaced F pa es is ver { small. Then, even when a sequential inva ldatlon o the blocks of a non-empty replaced page is Implemented, the global performance loss due to page invalidations is small.
