Abstract
Introduction
Techniques exist to hide or tolerate the latency of Ioads and stores. Executing instructions in an out-of-order manner helps hide latency. Compilcrs can assist by performing loads welt before the valuc i s required by other instructions. Prcfetching data and instructions into the cachea is an approach implemented in both hardware and software to reduce the high-latcncy offchip memory acccsscs. Despite these and aimilar techniques, memory access latency is still considered one of the primary bottlenecks in processors today.
Almost all modern processors allow dynamic ordering of load and store instructions. Non-blocking caches and buffering structures such a write buffers, storc bnffcrs, store queues) and ioml qucucs arp typically ctnploycd. We find that the manner in which stores are handled within the store buffer can impwt the performance of processors. In the absence of adequate literature discussing this impact, we cxmine scvcral storc buffer issues, including size, store removal, store retirement point, store prioritu shifting, and virtual store buflers. A thorough study of these policies shows a potential for increased load forwarding and dccrcascd load latency with changes in policy.
The chdlcngc is to achieve this performance win without increasing store buffer related stalls &ea stalling thc pipdinc since no new stores can be issued) or other dctrimcntal cffccts. We find that thc lazy store removal scheme can haw a large impact on processor perfbrmAnCB. In addition, we see that smaller, welldesigned store buffers can achieve comparable performance to larger, bnsic s t m e buffers.
The paper is divided into the following sections, Section 2 provides more background on store buffers and the issues we arc addressing. Section 3 discusses the store buffer policies and parameters that wc study. Section 4 discusses our methodology and benchmarks. Section 5 discusscs the results from dif€ererent store buffer configurations. Section 6 offers a summary a i d conchisions.
Background

Related Work
Johnson provides an in-depth discussion of Figure 1 while most of the write buffer literature asaumes a structtire similar to the one in Figure 2 . These write bnffers are accessed in parallel with the on-chip cachc and have the ability to combine several stores with contiguous addresses or the same address. Skadron and Clark discus8 the issues and tradeoffs involved in suc.h a write buffer 1231. Martoriosi 
Current Implementations
Commercial proccwors have been implementing store bufkrs or sirnilar idcas for many years. Although indepth analyfiis of performance tradeoffs arc not readily available, there are some indications of the type of policies that are being currently implcmented. Words in quotations indicate! processor specific terminology.
The Alpha 21284 microprocessor has a 32-entry speculative store buffer where a store rcmains until it is "retired". A store must first entcr tlic fipcculativc! stort! buffcr bcforc its data is sent to the level-one cache. Stores forward thcir data to l o d s when they are in the speculative store buEcr [13] .
The Sun UltraSPARC-IIi processor contains a loadlstore unit (LSU). The LSU is rosponsible for calculating load and store virtual addresses as well as "decoiipling" loads and storcs from the pipelinc by using both B load MFer and a store huffer. The pipelines arc not fully dccoupied so that the UltraSPARC-Hi can support precise traps. Stores in the store buffer norrnally have a lower priority than 1o;lds in the load buffer, but tht! CPU will eventually raise tho priority whcn a "Inclc-mtt cnndition" is reached. Tlmc is no mention of the ordering of thr! loads and stores or of the passibiIity of load forwarding, Finally, the LSU allows stores to be combined if thcy have been marked with a "write-gathcririg attribute," but this is riot done automatically ('w it; wnnM be in a write buffer) [ZO] .
The Pcntium 111 processor is said to have twelve store buffers, whcrc each store buffer can tornporarily hold a store to mcmory. This is essentially 01ic twchccntry storr! buffer. One positive effect of this policy is an increase in the average occupancy of the store buifer and therefore an increase in load forwarding;. Onc problem with having such a "lazy" removal policy is tht! potmtial of filling the store buffer. We also need to indicate with tags whir& stores havc retircd but arc still active in the storc buffer. Another problem with building up active entries is the possible increase in disambiguation time.
Store Priority Switching Cachc contention arises when there are multiple available memory instructions and a finite amount of ports or gateways into the first level of memory. Typically among loads, the oldest load that is ready to access memory (i.e. address sufficiently calctilrztcd) h,w tJhc highcst priority. Among storesj only the oldest store is permitted to access t,hc momory, given that it is non-specalativc, has its data, and the effective address is sufficiently calculated.
The policy for sclecting among the storee and among the loads is clear, but what happens if both a load and a store are rcady to access memory in some given clock cycle? One, the load is given a highcr priority, or loudy $rsL Two, thc store is given priority, or stores first. Three, the oldest instruction is given priority, or old& first, Thore is also the option t o change the priority schcmc! dyriamicdly m iniplcmented in the UltraSPARC-IIi. For instance, the default, priority could bc E O & first. Oncc thc level is equal to or above some threshold level, the policy then switches to stores first. The desired cffect is to reduce the number of store buffer stalls and therefore improve the performance of the processor.
Virtual Store Buffer
The data in store buffers may be accesscd by loads bcforc or after the address translation stage of the store. A virtual stom buffcr is accessed beforc translation nsing a virtual address. Thereforc, to XCCHS a virtual store buffer 110 address translrztinn is rcquircd. A load c m receive data from the store buffer without having its addrcss translatcd, saving the addreas translation cycles. In the case of a physical store bufler, both the load and the store must have their addresses translated beforc a load can properly access the data available in tho aturc buffer.
Aliasing, or the synonym problem, is an important issue when using virtual addresses to tag t h data in a storc bnffcr. Although this problem is infrequent, it docs necd to be dealt with. 
Benchmarks
For our study, we conduct simiilathn experiments on Sun UltraSPARC machines. We usc! programs from three sets of beIichInarlcs to evaluate tho storc bufffir schemes. Descriptions of thr! benchmarks anti thc inputs we use are in Table 2 . Table 4 . We use the base model storo buffer as a rcference when ascertaining the impact of tho store bnffer policies.
Per Policy results
In Figure 3 We analyse several progrmis from each suite. Due to time considerations, not every program from each benchmark imy hc used, each program may not be run under every configuration, and programs are terminated if and when they reach 500 million instructions.
Itesults and Analysis
Scvcral cxpcrirncnts were performed varying the individual store buffer policies for the base-line physical store buffer. Pipeline placement is found to have the greatest impact on the overall performance of the pro-CC~ROT. The lazy stnrc rcmoval policy lm thc great.-est impact on processor performance. Priority switching is found to have ncgligible impact at almost all thresholds. Pinally, varying: store rctirement policies providns soinr? of tlic same benefits il1a.z~ store removal policy, but with a higher penalty.
Tho optimal 32-ontry storc buffer in the design space examined is a virtual storo buffer with l<wy store Impact of Lazy Store Removal. This policy of store removal alone has an overall positive effect on the processor. Each benchmark analyzed has an increamd IPCi ranging from a ncgligiblc pcrforrnaxlco increase in m88ksim t.0 3.3% Ir'c improvernenl in richards.
The avvcragc! storc buffer occupancy rises to about 23 entries. Another effect of this policy is a substantial increasc in t,hc amount of load forwardirig and therefore a reduction in load rcqiicsta smt l o memory. Thc? nrnorint of load t r d i c is reduced by an average of 12.5%, ranging anywhcrc from 3.3% to 28.6%.
Unfortunately, all of these saved loads do not translate directly to pcrforrnaricc incrcasc. Since the load opcrations must have the same address as ;t reccnt store to perform load forwarding, almost all of thc loads hcing forwarded would have been h i t 6 in thc level one cache (which has only one-cycle latency). Expensive loads, likc L l misses, are not usually caught by tho stare buffer. The simulated memory system also provides two MSIIR's to further hide access latencies. In addition, a dynamically scheduled processor's ability to tolcratc laad8 vnsics from load to load and grogram to program [24] . It is possiblc that the loads that are nvoidcd due to incrcascd load forwarding are oncs that can tolerate latency. 
4.37%.
U8ing the priority switching model, this is rcduccd to 0.13%, but the IPC decrcaece by O.OS%, There are two probahle reasons for the performancc decrease. One is that the amount of load forwarding has been decreased, allowing more loads to access memory and incur a longer latency. The other reason is that the reduction in cycles wiih a store buffer stall is relatively small. R.aisirig-the thrcshold at which the store priority switches to 24 active entries also decreases performance, but by a lesser amount. The programs compress96 and i d 1 which also had significant stall cycles due to the store buffer did not see any improvement; in performance from this policy.
Store Retirement Point. We can see that the modifying the store retirement point results in a performarice increme over the b x e model. The IPC is increased by an average of 0.93%, including one case of an IPC decrease. Our studies show that the average occupancy of the store buffer increases from an average of less hhan eight h the base model to an average of about 17. Therefore, increwing the store retirement threshold improves the potential of load forwarding, hut also incrcasas store buffor stalls. WO find that tho storc buffer stalls dngratlo t.hc performaice gain from thn cxt.ra load forwarding;.
There is an effect similar to that of a lazy store removal policy -a large increase in load forwarding.
The difference is that the percentage of cycles with a store buffer stall now increases more substantially. A store can not be considcrcd for removal until it liar bccn retired. In the lazy removal case, st,orns are retired early a~i d are fully prepared to be purged from the store buffer when the threshold is reached. In the late retirement scenario, stores are less likely to be prepared for removd as the store buffer fills with active entries. These results show that allowing stores to contend for the memory interface resources as soon as possible does not hurt performance. If there arc several outstanding stores, they can block the cache from performing important loads, but this does not, appear to be a critical issue to performance. It is more important t o utilize available L l bus cycles.
there need not be an address translation to complete the load.
policies Ver81tR siao
This section investigates thr! cffects of fitore hu&r size Impact of Virtual Store Buffer. Implemerlting a virtual store buffer produces thc! best performance increase of any single policy studied, The IPC for thc benchmarks increased by an average of 4.2%, ranging from a 0.13% increase to 18.77% increase in richards.
Rediicing the address translation step is the primary reason for the performance gain. The number of translations is reduced by an average of 5,82%, ranging from 0.96% to 10.89% since it is assumed that a process ID anti virt,ud address are sufficient to properly dctermine address dependencies.
For cadi load that is forwarded, the address translation latency is not incurred (our model a,llows one cycle for TLD hits which occiir 99% of the time). Load forwarding increases slightly with a virtual storc buffer vermis the base model, because the load can ltcccss the store bulrer earlier.
Adding Lazy Store Removal to n Virtual Store Buffer. Studying several combinations of the discussed policies, the best processor performance fnr a processor with a 32-entry store buffer is achieved by making it virtual and implementing the lazy store removal policy with a thrediold of 24. Store retircment should remain at the point indicated in thc haw model, o n w all previous instructions are cornplctcd and the store is non-speculative. Swihching store priority does not need to occur for performancc reasons, although in tho UltraSPARC-IIi they felt it was important to avoid "lock-out conditions."
Lmy store removal increases the number of forwarded loads, and virtual accessing reduces the num-INP of address translations that can be saved. If the best cast! (33.6% increase for richards and the worm c&qe (0.54% increase for m88ksim) arc ignored, we find that the performance incroasc avaihblc from combining a virtual store buffer with laiy removal ranges from 0.79% to 9.57% and an average of 5.11%. Thc dccrease in L1 load traffic and address translations is substantial. The number of Ioads that access mcmnry is reduced by an average of 12.97% whilr! t8he nnmbcr of address translations is reduced by 12.63%. The addrcss translations include translations for store instructions, so they do not reduce at exactly the m n c rate as the load trafic.
Load traffic reduction is a direct result of increased load forwarding. Dy implementing lazy store removal, tho averagenumber of active cntrics in the store bufFer incream, irnproviiig the chances for load forwarding. Making the stow huffer virtual accounts for the address translation reduction. For each forwardcd load, with and without &e improvement from the optimal policies. Figure 5 compares Figure 5 shows tht! cnhanced four-entry and t!ightb entry store buffers arc not able to outperform largcr storc buffers on average, l'h foiir entry storc buffer shows a significant performancc drop versus the base model, This i8 the result of the hiiffcr being too small. A store buffer stall occiim in about onethird of all cycles in this case. Tho avcrage store buffer occupancy per cycle is almost tlirct! cntries which explains the ineffectiveness of a lazy 5t.oi-c removal tIircshold of two (S4. LR2). Making the four-cntry store buffer virtual (54. LR2. V) is a big improvement, cspccially for the C programs, but does not really approach the performance of a simple 8-entry buffer (S8).
The optimal eight entry buffer (58. LR6. V), on tho othcr hand, does approach the performance nf a naivr! 1G-cnl;ry store buffer (Si6), despite the fact that it,s simplc configuration (S8) is significantly worsc khat that, of the IBentry store birffcr (S16). Six bcnchmarks perform better on the opt,imitl eight-entry buffer than the naivc 1Bentry bufFer and four of those (go, ijpeg, deltablue, richards) perform bottcr than thc IMSC modd. Figure 6 begins a closer look at the 16-entry configrir,zt,ions. Thc IPC numbers indicate that it is the virtual afipcct of t,hc 1Bcntry stme buffer that increases performanco inore than lmy storc rcmoval. In Figure   7 , it is apparent that almost all of the improvtirnant in load trrtffic is the result of the lazy store removal. So, the extra performance provided by virtual storc buffers is strictly the result of a lower latency for acquiring load data, Thc details of the best configuration are presented in Table 5 .
In the 
Conciusions
Due to a lack of literatim on the details of the store buffcr, we took this opportunity to delve into thc issues involved in designing a stare buffer for a clynamicdly scheduled, out-of-order processor. These storc! buffer issues include size, store removal policy, store retirement point, store priority switching, and virtual store buffers.
Wc find that incorporating a lazy starc removal policy alone siibstantially increases the amount of load forwarding that takcs place, yet docs not greatly incrcmc the number of store buffer stalls in a 32-entry store buffer. This increme in load forwarding reduces thc ruimber of loads that, access memory by 12%. This leads to a performance impravement (in IPC) ranging from 0.15% t o 6.9%. A 16-entry store hiiffcr with this policy can approach and in some c a m surpass the performance of a 32-ciitry store buffer. This policy has less effect on store buffers of four and eight entries.
Switching from the basr: model to a virtual store buffer model improves pcrformartce by reducing tho number of address translations that take place bcfora useful memory access work can be performed. Forwarded loads now avoid the address translation latcncy. The IPC increascv by an average of 4.1% in this c a m By both incorporating lazy store removal and making the store buffer virtual, we find that the IPC of the processor can increase by an average of 5.11% over all benchmarks for a store buffer of size 32 and by as much a9 33% in specific cases. On average a 16-entry store buffer with these policim outperforms a normal 32-entry store buffer. Evcn an eight-entry store buffer outperforms a 32-cntry store buffer for certain benchmarks. Four-and eight-cntry store buffers with this implementation, on average, approach but do not exceed the ncxt larger size studied.
There arc, of course, many cornbinationfi of policies, configiirations, and parametcm that we did not explorc! due to time considcrations. It is possihle that some other combination of stare removal, store priorities, and a store retirement threshold could create slightly better pcrfarmance. What this p q m should convey to the reader is that there are many store buffer design decisions to makc and the subsequent impact, on performance is not trivial.
