A common mechamsm to pcrlorm hat-dware-based pl-efetching for regular accesses to arrays and chained lists is based on a Load/Store cache (LSC). An LSC associates the address of a Id/SC instruction with Its individual hchavior at every entry. WC show that the implementation cost of the LSC is rather high, and that using it is Ineff'icicnt. We aim to decrease the cost of the LSC but not its pcrl'ormancc. This may he done preventing useless instructions from hemg stored in the LSC. We propose eliminatmg those inslructions that never miss, and those that follow a sequential pilttWl1.
INTRODUCTION
Every effort to decrease cycle time or increase Instruction Level Parallelism when designing a high performance microprocessor, may he neutralized by a slow memory subsystem [2, 31. A large number of techniques have been developed to minimize the latency impact of a data reference, reducing either the number of misses (e.g. hardware and software based prefetching) or their cost (e.g. non-blocking caches, multithreading, decoupling).
S&ware hased data preielching techniques use special PREFETCH instructlons to bring a block into the cache memory in advance. Most contemporary architectures include this instruction, and many compilers can transform simple loops in order to decrease their miss ratio substantially. However, these software techniques sometimes perform poorly with respect to hardware prefetching [4, 231.
Moreover, hardware based prefetching does not need to recompile, and cloes not increase code size. A typical approach consists in pret'etching one or more consecutive blocks in a sequential way 121 1, and it is l'requently used on the instruction stream. In recent years (Ol-97), altcrnativc approaches have been proposed to predict non-sequential accesses or to issue prcfetching at the right time.
In this paper WC show a way of improving hardware-based preietching in regular accesses to arrays and chained lists. Particularly, we focus on the Load/Store Cache (LX). a mechanism driven by the address stream generated by each id/St instruction. The LSC associates the PC-address of a Id/St instruction with its individual behavior. Every time a Id/St instruction that is not present in the LSC is executed (LSC nzm), it is inserted into the LSC. Once a pattern has heen recognized and the instruction executes apam, a prefetch address will be computed and issued.
Previous work proposes a dil-cct-mapped LSC with a number of entries between I28 and 5 I2 [I, IO, 16, 171 . It will be shown later that the cost of a LSC with 5 I2 entries is similar to that of a doubleported cache memory with a size between 6 and I2 KB. This may decline the USC of the LSC in l'avor of more important resources.
Our purpose is to improve the use of the LSC entries in order to decrease their number. We want to achieve similar (or even better) performance while reducing the cost.
As lar as we know, no previous reference considers both, the LSC performance [ 10, 161 and the behavior of the Id/St instructions regarding their addressing patterns 18, 171. and the workloads they consider arc not very extensive. Our first contribution goes in that direction, and is based on the analysis of 25 programs (taken from SPEC92, SPEC95 and Perfect Club). From this characterization we can advance the following results: a) the miss ratio for a directmapped LSC with 5 I2 entries is too high for many applications, b) most ot'the instructions follow scalar or sequential patterns, and c) most ol'the Id/St instructions hardly miss.
In this paper we propose preventing some instructions from being stored in the LSC in order to reduce its number 01. entries. By considering the former results, we identify and discard instructions that: a) induce the preletch 01. blocks that are already in the cache, or b) show a sequential addressing pattern. This can be done by combining two strategies:
-Storing Id/St instructions in the LSC only when they miss in the data cache (on-nziss imrrfron)
-Performing sequential prefctching in parallel. In particular, we use One-Block-Lookahead sequential tagged prefetching (OBLst) [2I].
To understand the key aspects of prefetching, we use a performance model that takes into account the cache lookup pressure, the data CPI, the shift from CPU misses to correctly prel'etched misses, and other useful characterizations of the prefetching dynamics. Through this model, we compare our appi-oath with the conventional LSC by means 01' a detailed cycle-level simulation, varying parameters grants TIC98-05 I I-CO2 and TIC% I 127 SLICII as LSC and data cache sizes. A noticeable point is that perf'ormance holds when the number of LSC entries are reduced: wc show that an LSC of only 8 entries, managed by on-miss inscl-tion aid combmed with sequential prefetching, performs hctter than a conventional LSC of 5 I2 entries.
This paper is organized as follows. Section 2 reviews related works. Section 3 presents the workload, the methodology we have followed and the characterization of the Id/St stream behavior relevant to the LSC operation. Section 4 introduces the insertionm1s5 policy and the combination with sequential data prefetching in detail. Sections 5 and 6 describe the performance model, give experimental results and discuss implementation costs. Finally, Sccrion 7 st~mmarizes the main points of this contribution.
PREVIOUS WORK
Noti-sequential data prefetching was firstly introduced by Haer and Chcn [I] under the term Prcloading. The LSC used in that work was called Refbrence Prctliction Table (RPT) . An RPT is organized as a cache indexed by a Look Ahead Program Counter (LA-PC), whose value is based on some branch prediction policy. LA-PC varies from the current PC value to a limit imposed by a constant named prefetch distance which is proportional to the latency of the following level. If LA-PC hits in the m, and the state of the selected entry indicates a stride pattern, the addition of the latest address issued by the correspondmg ~d:st plus the stride is prefetched. Prefetched blocks are directly added to the data cache.
Stride-directed prefetching [I OJ and speculative prefetching [ 161 use a G~iilar table, but now indexed by the PC. Every time a ld/st Instruction is executed, the table is searched. Ii the instruction is found in the table, the corresponding prefetchcd block will be used in the next iteration. LJnder stride-directed prefetching, target blocks are loaded directly into the data cache. Under speculative prefetching, the prcfetched blocks are loaded into a small separate cache.
Finally, [17] suggests an addition to the previous approaches in order to detect a linear traversal of a chained list made up of records. It is assumed that each record has a next-address field that points to the next record. This mechanism detects the Id instruction that l-cads the next-address field, and uses that value plus/minus a constant as the prefetch address. This paper also considers the pattern that appears when traversing sparse arrays whose non-zero elements are chained by an index (e.g. spice).
Besides these papers about prefetching based on the classification of ICI/S~ instructions, another big group which directly deals with the global sequence 01' addresses (or first-level cache misses) can hc considered. The approach presented in [I 1 J and used in [8] is based on keeping a list of common strides by calculating the strides hetween each reference and the previous sixteen references.
In 1201 the minimum delta scheme (also used in [9] ) and the partition scheme are introduced. The former calculates the stride as the minimum difference between a missed address and the last II missed addresses. The latter splits memory into zones, and calculates the stride between the last two references to the same zone. In [ 141 Markov Chains are used to prefetch multiple rclerence predictions from the memory subsystem. These schemes arc proposed I'or off-chip environments, in which the address of the instruction to be included in the LSC is not available.
Finally, some approaches try to issue the prei'ctch as sow as possible, or even to determine which is the optimal time to do it, supposing very large latencies (c.g. to fill in advance off-chip caches in shared memory multiprocessors). Thus, [I 51 proposes the use of stream buffers, which issue several requests in sequence instead of issuing a single WC and store the requests until they are referenced. The purpose of this policy is to increase the prefetch distance, i.e. to request blocks even earlier. In [20, 91 this idea is also applied to streams with stride accesses, Adaptive sequential prcfetching is proposed in [6, 7] . In [X] the same idea is extended to non-sequential stride prefetching, and a comparison between the sequential and not sequential cases is carried out. Adaptive prefetching modifies the prefetch distance dynamically, according to the latency of the system and to the loop size. Prefetched blocks are stored in the cache.
There are many papers dealing with software-based prefetching which we do 1101 mention here, for they lay beyond of our scope. We should mention [I',], however. because it concludes that most misses are caused by Id instructions with stride and list patterns, and this is a key idea in our work, as we have exposed in Section 1. However, it must be pointed out that [ 191 Cl. The instruction must remain some time in the LSC before being useful, because we need several executions in order to fix the prefetch condition.
C2. The instruction must access to memory following one of the regular patterns that can bc recognized.
C3. The datum that is being accessed by the instruction must not reside in the data cache, since prefetching is not necessary in that case.
Previous work based on the use of an LSC [ 1, IO, 14, 171 does not consider the optimization of the LSC in these terms, and gives no detailed characterization of the relative importance of the different patterns recognized.
Workload and tracing methodology
The chosen workload is a set of 25 programs taken from SPEC95 (8 integer and IO floating point), SPEC92 (spice) and from Perfect Club (6 tloating point). This workload has been targeted to a SPARC V7 architecture. The user-mode execution traces have been obtained dynamically by means of Shade, a utility from SUN microsystems [5] . Figure I shows the complete evolution of CPI for a single issue SPARC processor executing Spice on a sample system. A transient behavior up to 4,000 million instruction can be observed. From this point on, a steady state appears with low variation in the mean value. The majority of the papers referred to in Section 2 assume a transient state of some tens OF million instructions at most, and take a single observation after this point. This may not be representative of the whole program behavior. typically to the initialization of data structures hy using very smrple -almost sequential-access patterns.
'T'ahle I shows the following data (from left to right): program input, number 01. mstmctions. total number of loads and star-es of the full execution, number of instructions of the transient phase and number of instructions skipped until statistics begin to be computed. All the numbers arc in billions. Programs labeled as irregular do not have a clear transient phase. To compute the means, we consider the three groups that are differentiated on the table (we include Spice from SPEC 92 inside the SPEC95-fp group). 
LSC misses
WC define the miss ratio in the LSC as:
whcrc N is the number of memory references, and a miss is counted ever-y time the PC-address of a Id/St instruction is not found in the LSC directory. This miss ratio is related to the condition Cl, because the average number of times that a l.d/st instruction is executed hcforc being replaced is just the inverse of mLSC, Table 2 presents the average values of mLSC for a direct-mapped LSC with a different number of entries. If we analyze the individual behavior of each program, it can be observed that in order to achieve a value of mtdsc less than or equal to lo%, we need at least 1024 entries in 10 out of 25 benchmarks. With m,,=]O'%, a ld/st instruction remains in the LSC for ten consecutive instances in average. Whenever the instruction is replaced, 3 executions are needed to detect a pattern before triggering the prefetch. Therefore, the ratio (#prefetches/ #references) is 70%. With a miss ratio of 25%~~ that ratio drops to 2.5%. Table 2 : mLsc, average miss ratio of a direct-mapped LSC in % Even though we do not know the r-elation between mLSC and the reduction of the effective access trme due to prefetching, those data suggest a higher number of entries with reference to previous papers [I, 10, 16, 171.
Pattern Distribution
Even supposing an ideal behavior of the LSC (mt.SC = 0), some pattern is needed to trigger prefetching (condition C2). We have looked in our workload for five patterns that can be recognized by hardware techniques. This has been done by tracking the following equalities for each Id during its execution i, Di is the value read by a load instruction during its execution i, and Bsize is the block size. Stores can only follow the first three patterns.
The groups presented in Table 3 Most of the accesses are scalar or sequential. Few of them follow stride or chained list patterns, and they concentrate over a few benchmarks. The higher percentages in the Remaining column are mainly due to integer programs whose access patterns we do not detect.
' An example of the pattern IND is accessing to a non-compressed sparse array by means of another array with the indexes of the nonzero elements; "INDex list" models the reference to the Index Array.
The program spice2g6 shows a large percentage of this behavior. A lot of research on the matter reports a high number of stride accesses, because sequential accesses are considered as a particular case of stride acccsscs. An exception is [8] , where sequential and stride patterns arc separately studied in programs of the SPLASH-l suite, in a multiprocessor environment. The distributions given in that paper are vel-y similar to those we have t'ound in SPEC95 and Pcriect Club.
The h-cqucnc SEQuential pattern, when considcrcd as a particular case of the STRide pattern, ensures the utility of any LSC prepared lor detecting strides. However, sequential prefetching is simpler (no LSC is required) and cheaper (at most one hit per block).
On the other hand, prefetching SCAlar patterns is useless when LSC is indexed through the PC, because a variable is accessed and prefetched at the same time. However, this may be useful when indexing the LSC through the LA-PC.
The sum of stride and list patterns (STR, PTR and IND columns) is small hut noticeable: the average values for Perfect, SPEC95-int and SPECYS-fp are 10.X%, 4.6% and 8.35% respectively. Ncvcrtbeless, in some programs most of the accesses follow these patterns (ARC2D. TRFD, applu, spice2g6) and a high benefit from prctetching may bc obtained.
Execution and miss frequencies correlation.
Whatever piltlerrls they follow, it is of no use to keep Id/St instructions that never miss in the LSC (condition C3).
Up to now we have considered that a Id/St instruction is inserted in the LSC when it is executed and misses in the LSC. We call this strategy tr/wtry.s insertion. Under nlwa~~ insertion and in the absence of conflicts, the probability for a Id/St instruction of being in the LSC is proportional to its frequency of execution. However, the set of Ed/
St
Instructions we arc keeping in the LSC (the most frequently cxccuted) may not be the most suitable set (the set that would cause more cache miascs if prefetching were turned off).
In The Id/St we want to store in the LSC belong to classes C and D, yet those which are really filling the table belong to classes B and D. It can he observed that BUD is greater than CUD by a factor of 5, 2 and 5.4 for Perfect, SPEC95-int and SPEC95-fp respectively.
Instructions belonging to class C have a low probability of being in the LSC. However, they yield many misses and in consequence, it would be convenient to keep them in the LSC. On average, they represent 25% of the set with more misses, CUD.
On the other hand, ld/st instructions of class B have a high probability of being in the LSC. However, they hardly miss in the cache and it is of no use to keep them in the LSC. On average, they account for 78.6% of the most executed set, BUD.
ON-MISS INSERTION PLUS SEQUEN-TIAL-TAGGED PREFETCHING
From the pattern distribution and the correlation between execution and miss frequencies, we propose a combined prefetching strategy, in which addresses are computed by two independent prefetching mechanisms working in parallel: a) LSC with on-miss insertion (LSCmi) prefetching, and b) One Block Lookahead sequential tagged (OBLst) prefetching [21] .
Under on-nziss insertron, a bit in the LSC involves the same actions as under &V~!JJS inscrt~o~z (updating the state, the data field, etc.). But in cast of a LSC miss, insertion is performed only when there has been a miss in the data cache too.
This way, the probability of finding a Id/St instruction in LSC is proportional to its miss frequency in the data cache, and not to its frequency of execution. If two or more Id/St instructions are mapped to the same LSC entry, they do not contend for that entry if they hit in the data cache, increasing the stability of those instructions that miss.
We add OBLst prefetching to prevent the Id/St instructions which follow a SEQuential pattern from contending for LSC entries. Moreover, sequential prefetching can determine sequential relations among d$/. 'ivent loads, adding useful prefetching streams which an LSC would not be able to generate.
By way of example, let us consider a chained list at's true ts greater than the block size. The Id instruction which reads the pointer field is classified by the corresponding detector. However, Id inslructions which access the remaining fields do not follow a regular pattern; OBLst prefetching can perform successfully in this case.
In Table 5 we show the replacement ratio for an LSCmi, defined as the number of ~1st entries evicted divided by the total number of references. This ratio is comparable with that of Table 2 in the sense that its inverse is the average number of times that a given instruction is executed while remaining in the LSC until it is rcplacetl. It is noticeable the one/two order of magnitude shift hctween the two tables. Table 5 : Average replacement ratio of a direct-mapped ISCmi.
PERFORMANCE ANALYSIS FOR A SIN-GLE PROCESSOR-MEMORY SYSTEM

Workload and tracing methodology
The evaluation has been carried out by using the workload descrihcd in subsection 3.1. However, we assume now a multiprogramming environment with a quantum of I million instructions. We also assume a multiprogramming degree high enough to empty the first-level caches between every two bursts of the same process completely.
Given the high temporal cost of the cycle level simulation and the big number of benchmarks that have been analyzed, it is not possible to proceed with the same number of instructions that was used in Section 3. A limit of 20 million instructions has been fixed. However, to improve the representativeness of sampling, we scatter the 20 quanta (observations) over a certain interval. For each application this interval starts at the end of the transient phase we showed in Table I , and its size varies according to the benchmark behavior (e.g. a lot of benchmarks follow some periodical behavior; the size of their intervals matches the size of their periods). Similar studies confirm that this kind of sampling is much better than the contiguous selection of the observations [ 131. 5.2 System model Figure 2 shows the system modeled. The processor is a single issue in-order SPARC, similar to that used in related papers [8, 121 . We consider a level one (Ll) on-chip split cache memory, and a level two (L2) off-chip unified cache. Block size is 32B in both cases. In all experiments, we fix a 32KB direct-mapped Ll instruction cache with OBLst prefctching. L2 is ideal in the sense that it always hits, and has a pipelined interface for L I block requests of I :7:2 cycles for address transfer, access and data return, respecrively.
The LI data cache (LldC) is also direct-mapped, hut its size and prefetching capabilities are varied (sizes: X KB, 32 KB and I2XKB; prefetching: OBLst and/or LSC, with LSC indexed by PC or LA-PC). The LSC detects PoinTeR list, INDex list, STRide and SEQuential patterns by using the policy exposed in subsection 3. which is similar to the one used 111 1171. Demand fetches and prefetches (low priority) have 1o contend for a single cache port. Therefore, a high prefctching lookup pressure may degrade the system performance. When sequential and LSC prefetching work in parallel it is possible to issue up to two prefetches per reference, which are temporally held in the lookup buffer.
As in other papers, a I6-entry victim cache is added [ 22,141. This way, we focus on the benefits prefetching offers for the elimination of capacity and compulsory misses, and not on its ability for dealing with conflict misses. Some benchmarks experience a large fraction of conflict misses, in particular appiu, apsi, su2cor, swim, tomcatv and wave!5 from the SPEC95 suite.
ORL (Outstanding Request List) is an address buffer which supports pending prefetches and gives information about the blocks currently being read in L2.
Performance model
Global measures such as CPl retlect global effects, but they do not capture critical aspects of prefeetching. To isolate them, we suggest a model (Figure 3 ) which is partially based on [22] and which considers the following quantities : Fi.om these quantities wc define: a) Number of prefelches perrcterence (f/N): h) Full-latency miss ratio (md = ill/V); c) Partialatency miss ratio (ml-= CfIN); d) Conventional miss ratio (172 = (A+(X)/ N = f7zcl + r7lt-); and c) Prd'ctch miss rat10 (I/I,' = H/N). that can be Iirthcr splil in ~c~/rss and 2/.se/ztl prcfetch miss ratio (1~7~1 = f)/N and IJZ,,~. = UN). Noio that mp + rlzd is the L2 access ratio. That is, the number or acccsscs to L2 per rcferencc; this is ;I mcasurc 01 the prc5surc put on L2 by Ll dC ilnd the prcfctchiny mechanism toghcthcr.
Tlic generic go,11 ol preletchin, (T is Lo clccrcase tlie convcntlonal miss ratio (171, and especially the fraction /JIM) with ii minimum prcssurc over I> I (minimum I'/&') and a minimum L2 accesb ratio increment (minimum f~zp)'.
These three aspects must he bnlanccd l.or each particular system. Thus, obtaining a mlnimum nz can depend on the miss penalty at the following memory level, whereas obtaining a minimum PIN can he essential depending on the number of ports in the cache. Finally, ohtaininp a minimum nzp can bc crltical if the Ll/L2 l~;inclwidlh is limited. figure, the average values for SPECYS-fp and SPEC95-int groups are displayed separately. The hehavmr of the Perfect Club group is fairly similar Lo that of SPEC9Sfp. We will omit futhcr references to this workload due to space limitations.
By observing the three l'igurcs as a whole, it is noticeahlc the difference between lloating point and integer applications. SPECKS-int is almost inscnsltivc to the kind of prefetching, and the reduction in 171 (conventional miss ratio) due to prcfetching is very low, varying Ii-om 7. I %f (OBLst prct'ctching) lo 19.2% (OBLst + LSCm-S12entr.). This result can he hctter understood if we consider the characterization given in subsection 3.3., since scalar and irregular patterns prevail hcrc. However, SPEC95-fp is yuite sensitive to the kind of prcletching and m decreases between 27.5% (OBLst prel'ctcbing) and 86.3% (OBLst + LSCmi-512entr).
Another global observation deals with the poor c~ccurucy of OBLst prct'etching in the integer workload. Defining UCCU~L~CY as W(fI+&), we can see the great waste experienced in SPECOS-int: 0.23. In SPECYS-cp the OBLbt accuracy raises to 0.77.
It can be observed in Figure 5 that the floating point workload benefits Irom sequential prefetchin g, and it is quite sensitive to the size of the LSCconv. Only with a hp LSC (5 12, 12X entries) do we obtain a better data CPI than with OBLst prclctching. The main reduction in m appeal-s in LSCconv-5 12cntr prefetching, which eliminates 8 I %J of misses, whcrcas OBLst prcfctching removes 67.4%~ of misses.
The lookup pressure perl'ormed by LSCconv on LldC (i.e. f/N) is high. For LSCconv-S l2cntr, I'lN reaches hl.2%, i.e. the number of xccsscs to LldC is multiplied by I .6l. On the other hand, OBLst only loads LldC with 7.3% extra lookup activity. Let us now analyze separately the performance of LSCmi prefctching ( Figure 6 ).
In SPEC95-fp on-miss insertion significantly decreases the lookup pressure (PIN). With an 1,SCmi ot 12X-512 entries, 35% 01 prel'etches gcneratcd hy an LSCconv prefetching arc removed. With Iwer entries in LSCIJC, the lookup pressure increases with respect to L,SCconv hccausc in that case LSCconv hardly issue pre<ctchrs.
On the other hand, the number of pret'etch misses forwarded to Id2 (I,') incl-cases with respect to LSCconv prefetching (()%I, 4%1, 20%, 52% and X2% lor 5 12. 128, 32, 16 and 8 entries respectively). In the presence 01. II regular pattern, those preletch misses are useful, and m(LSCml) i m(LSCconv). Only with a 5 12.entry LSC does //f(LSCmi) incrcasc 0.8%. For the rest of cabes, It always &creases: 9.X%. 27.X%, 37.1 %I and 28.8Y0 for a LSC with 128, 32, I6 and 8 entries.
With regard to performance, LSCmi prefetching reduces the data CPI of a syslem without prefetch from 43% (8 entries) to 68% (5 12 entries). The behavior of LSCconv prcfetching is always worse: LSCmi reduces the data CPI of a system with LSCconv from 3.6%~ (5 12 entries) to 30.5% ( 16 entries).
In SPEC95-int we can notice the same tendency, although the differences between LSCconv and LSCmi pretetching are smaller. Moreover, the reduction in data CPI relative to a system without pl.cfetching is SUlilller ( 1% 14%). On the whole, the number of prefetch misses (H) increases largely with respect to a LSCconv system: from 34% (5 12 entr.) to 26% (X entr.). The miss ratio decrease is also noticeable: between 2 I %) (5 I2 entr.) and 70% (8 entr.). Data CPI decreases are between 2.4% (5 I2 entr.) and 56%~ (8 entr.).
In SPECOS-int, results are qualitatively but nor quantitatively similar. The accuracy decrease is noticeable due to the concurrent activity of OBLst prcfctcbing.
On the whole, the most relevant fact is the very Iow sensitivity of tbc miss rat10 and the data CPI with respect lo the size of LSC. When moving li-om 5 12 to 8 entries the loss in performance is only 3.75%) in data CPI lor SPEC95-fp.
As Table 6 shows for SPEC95-fp, LSCmi-Xentr + OBLst prefetching achieves belter ratios in almost all the metrics with respect lo LSCconv-5 12entr. It reduces the lookup pressure (J'/N decreases S&4%), increases the prefctch miss ratio (28.7% more in R/N), and so dccreascs the miss ratio (10.3%). As a ncgativc effect, a loss of accuracy --U/(U+6)-appears due to the use 0fOBLst. In both cases data CPI is almost the same, but with a cost 64 times lower. Table 6 : SPECYS-fp comparison between LSCconv-512entr and LSCmi-Xentr + OBI,st.
I Results on programs which fbllow regubr patterns
To observe the performance of our proposal when applied to programs with an outstanding presence of. stride and list patterns, we have carried out a sclcction over the whole workload. Figure 8 shows the means for applu, apsi and wave5 from SPEC95 FP, spice from SPEC92 and arc2d and trfd from Perfect Club. Results for LSCconv appear on the left, and LSCml + OBLst on the right.
It can be observed in Table 7 Table 7 : LSCconv-SlZentr vs. LSCmMentr + OBLst for applu, apsi, waves, spice (SPEC), arc2d and trfd (Perfect).
Although the Ics~dts discussed here arc based on a 32KB LldC, we have simulated other cache sizes too (XKB and 12XKB). The same conclusions arc valid for these sizes, yet the advantages of our method increase with the cache size. Eventually, we issue prefetches by using an LA-PC as in [I 1, and with a prefetch distance equal to I .5 times the memory latency. In this case the lookup pressure (P/N) increases, because the method tries to prefetch SCAlars too. Therefore, the advantage of our method is greater since most of these scalar Id/s t hit in LldC and are not inserted in the LSCmi.
COST ANALYSIS
Executing a lwst instruction requires at least two accesses to the LSC. The first access reads inlbrmation about that instruction which will be used later for detecting the pattern and for calculating the new state. The second access writes the updated information after the Instl-uction has been executed. With a simple pipeline, the reading could be performed during the ALU stage, after the Id/St instruction has been decoded. The computations for pattern recognition and for generating the new state could he carried out during memory access. Writing should be done in the final stage.
Such a simple implementation requires a writing port and a reading port in the LSC. If prefetches are driven by LA-PC, a third additional reading port is needed in orther to check if the instruction addressed by LA-PC is a ~d/st., and if it matches some pattern.
Every entry in the LSC contains a variable number of fields, depending on the patterns that we want to recognize. If we intend to detect strides only, four fields are required: PC', Ai, Si and state.
32-bit addressing yields 12 bytes per entry (stute needs only a few bits). If we intend to detect accesses to lists chained by address and index, we should add three more liclds (I%, di and Ki), and every entry would take 24 hytca.
To sum up, a direct-mapped LSC with 5 I2 entries stores between 6KB and 12KB, uses a decoder with 5 I2 entries (similar to that of a direct-mapped cache of 8KB with blocks of IbB), and requires two ports at least, since it must be tested and updated in a single cycle. Therefore, an LSC is comparable in size to a first-level data cache. Replacing such an expensive 512-entry LSCconv by a 8-entl-y LSCmi + OBLst divides its storage costs by a factor of 64. It is difficult to apply an area model in order to take into account the I'ixed cost of the control unit due to the great inaccuracy of such models when computing area for very small caches [ 181. 7. CONCLUSIONS It1 111is paper we have analyzed the performance of a load/store cache as a base for dil'lerent proposals of hardware-based data preletching with patterns other than the sequential one. We have found that in order to perform better than with sequential preietching, it is necessary to provide as much storage area as for a lirst-level data cache. This is due to two key facts: a) regular patterns different from the sequential pattern are uncommon; and b) most of the instructions that occupy the LSC entries do not miss (i.e. the involved prefetching is useless).
Decreasing the cost of the LSC with no efficiency loss, implies that useless instructions must be removed from the LSC. To do that we propose applying on-miss insertion in the LSC (LSCmi) working in parallel wit11 tagged sequential prefetching.
On-miss insertwn introduces new instructions in the LSC only if they miss in the data cache. This way, instructions tI1at can take profit from prefctching will be more likely included in the LSC. On the other hand, sequential prefetching reinforces this point because it prevents the Id/St instructions whicli follow this pattern from contending for the LSC entries.
For numerical workloads (SPEC95fp and Perfect Club) our proposal achieves a great increment in performance for every cache size, specially for the small ones. We believe that the relevant point here is that the performance of an LSCmi decreases only 3.75%) in terms of data-CPI when the number of entries decreases born 512 to 8. An LSCmi with 8 entries working in combination wit11 an OBLst prefetching achieves a performance comparable to that of a conventional LSC with 5 I2 entries (with a storage cost 64 times lower).
FOI-non-numerical workloads (SPEC95-int) the performance of an LSC is rather limited because of the absence of recognizable regular patterns. In this context, it makes little sense improving its management, Anyway, since our method reduces the number of entries strongly, it is possible to increase the number of fields of each entry (for detecting new patterns) with little cost.
