In this paper, we propose the design of an on-chip hardware support function whose goal is to reduce the memory latency due to data cache misses. We will show how it can reduce the contribution of the onchip data cache to the average number of clock cycles per instruction (CPI) [3] .
The component of the CPI due to cache misses depends on two factors: miss ratio and memory latency. Its importance as a contributor to the overall CPI has been illustrated in recent papers [1, 5] where it is shown that the CPI contribution of first-level data caches can reach 2.5. Figure  3a) .
In all three cases, there will be cache misses. A preload request will be generated for address 90,800 (instruction 504) and the preload for address 10,000 at address 512 will be squashed since the block is still in the cache.
During the third iteration, when PC hits instructions 500 and 504 the changes shown in Figure  3 .c will occur in the RPT 
Performance Evaluation
Since our main interest is in the influence of preloading on data cache access, we assume:
(1) no I-cache miss (reasonable since preloading is active only during loops; the loop code should be resident in the I-cache), (2) all operations take one cycle i.e., perfect RISC !l pipelining), (3) no wait on a cache it, (4) the processor stalls on a cache miss until the data is in the cache (i.e., no nonblocking loads), 
Metrics
We present the results of our experiments by using the contribution to the CPI as the main metric. The contribution to the CPI due to data access penalty is:
cpIda~a .Cce,a = total data access time number of instructions executed
In the figures we show the percentage of data access penalty reduced by the preloading scheme, i,e.:
% of penalty reduced = CPIcache -CPIPTelOad~~00
CPIcaChw here CPIcaChe corresponds to the pure cache experiment and CPIPVelOad to the "add-cost" model. We compare the preloading scheme with the equivalent pure data cache design while varying the cache size. To be concise we often contrast only two extreme cases of the benchmarks.
5.2
Preloading performance for the Nonoverlupped model Figure  5 presents the performance of the "no-cost" and "add-cost" organizations with varying cache sizes for the Non overlapped memory model. The "add-cost" organization always performs better than the pure cache scheme since it has the same amount of cache and, in addition, the preloading component. The "addcost" organization will always perform better than the "no-cost" since it has more cache with at least the same amount of preloading hard ware (the "no-cost" at cache size N Kbyte has the same performance as the "add cost" at cache size lV/2 Kbyte).
The results show that the "add-cost" preloading scheme can reduce the data access penalty from 10% up to 95% compared to a pure data cache. 
The programs
Compress, and Pverify not shown in Figure  5 , do not benefit much from preloading. This is not surprising considering their characteristics (see Table  2 ). Compress relies on a data-dependent Figure  6 and Figure  7 show respectively the effects of preloading for the Overlapped and Pipelined models. Since these models are less band width rest rict ive than the Nonoverlapped, the preloading could take advantage of the additional degree of freedom to preload data blocks.
On the other hand, the latency is longer and any incorrect preload will result in a larger penalty.
In Figure  6 , we see that both "add-cost" and "nocost" preloading of the program Spice are slightly more advantageous than in the Nonoverlapped case.
In the case of MG3D, the effectiveness of the preloading is comparable to that of Nonoverlapped.
The Pipelined experiments shown in Figure  7 In general, the more contentious the memory model is, the more sensitive the preloading will be to the setting of the LA-limit d. We conjecture that it would be less interesting to set d >6 for the Pipehned model. From the above discussion, it appears that a good setting of the LA-limit is also a characteristic of the workload.
In future work, we will examine the setting of a dynamic LA-limit based on the length of the current basic block. 
