The advancement of process technology has lead to dramatic increase in the processing power. However, memory latency has not been improved at a comparable rate and therefore, has become a major limiting factor of system performance. As an attempt to mitigate this problem, hierarchical memory system containing several levels of caches is designed. Hardware prefetcher further reduces memory latency by fetching the data in advance from the main memory to the cache. This paper presents a Delta Correlation Prediction Table (DCPT) based Level 2 cache (L2 cache) prefetcher optimized for Alpha21264 microprocessor with 1 MB L2 cache. Additionally, the performance of optimized algorithm has been evaluated using SPEC CPU2000 benchmark suites on the M5 simulator. From the obtained result, it has been observed that overall achieved speedup using the implemented prefetcher is 1.91 which leads to 6.1% performance improvement over the referenced DCPT-P prefetcher.
INTRODUCTION
In the competitive world of computer architecture development, scientists and engineers strive to make the most efficient computers. One area that shows promise is prefetching. While processors have been progressing rapidly, they have been held up by the memory access speeds. This phenomenon is called the "memory wall" [1] . This has been helped by the emergence of caches, memory that exists on chip and process small chunks of data at faster rates than the main memory. This emergence of memory hierarchy has led to a constant trade-off decision between speed, size and cost. Larger memory banks are slower but more affordable. Fast memory access is expensive and cannot hold much. As speed and size increase, so does the price. Prefetching aims to make this process more efficient by using algorithms to predict what data will be needed in the caches, thereby reducing the number of cycles that pass before the data can be processed. There are two main patterns addressed by the principle of locality; temporal and spatial localities. When memory is used frequently within a short period of time while spatial locality refers to the use of data elements that are physically stored close together. Within spatial locality there is the special case of sequential locality, where the data is arranged and accessed linearly.
In this paper, we have implemented the Delta Correlation Prediction Table with Partial-matching (DCPT-P) prefetcher and optimized it for SPEC CPU2000 benchmark suites. This paper is organized as follows: Section 2 describes related work. Section 3 presents the optimal DCPT-P algorithm. which is followed by Section 4 that illustrates the methodology to optimize the performance. Section 5 presents and discusses our results. Finally, the paper is concluded in Section 6.
RELATED WORK
Many different algorithms have been proposed, used and combined to make more efficient prefetching heuristics. The sequential prefetcher is commonly known as the most simple [2] , it merely fetches the next block if a miss occurs in the cache. Slight modification brought about the tagged sequential prefetcher, with the difference being an extra bit on each cache line that is set when a block is prefetched into the cache. The next cache line is fetched when there is a cache hit on a block where the bit was set. One downfall of the sequential prefetcher is that it will fetch unnecessary data if the process does not access the memory in a continuous fashion. This problem is addressed by the Reference Prediction Table, [4] . In PC/DC, every time a L2 cache miss occurs the deltas between consecutive misses are calculated then stored in a table. The most recent pair of deltas in the delta history are searched for after the history of deltas are computed. Future deltas are then predicted if a corresponding pair is found in the delta history. DCPT (Delta Correlating Prediction Tables) combines the tables of RPT with the delta correlating design of PC/DC [5] . One way that it improves upon PC/DC is by avoiding pointer chasing in the GHB. It uses a delta table to predict the next memory block that will be accessed. Access Map Pattern Matching(AMPM) is a technique that addresses spatial locality [6, 7] . It uses fixed size sections of the memory to find what data is being used most. It uses 2 bits to represent the four states; Init, Prefetch, Access, and Success. These are stored in the cache lines of hot zones in the memory access pattern maps without any access order. The number of hot zones are fixed, as determined by the LRU(Least Recently Used) policy. The AMPM prefetcher uses this data to calculate the stride address correlation, by pattern matching using the memory access pattern map and determining the prefetch candidates. It also calculates which prefetch requests are necessary and the quantity of them.
OPTIMIZED DCPT PREFETCHER WITH PARTIAL MATCHING (DCPT-OPT)
DCPT-P prefetcher [8] is an improvement on DCPT prefetcher that takes in to account more complex and irregular pattern. In contrast to DCPT that fails to detect a pattern in case of complex access pattern, DCPT-P uses the concept of "Partial Matching" to fetch a block. The main concept of partial matching is to prefetch the most commonly accessed block or a previously accessed block if the regular pattern is not found in a sequence of deltas for the instruction under execution.
DCPT Table Structure
DCPT table is the core of the concept of DCPT-P Prefetcher. In our implementation, The DCPT table consists of the following fields:
-PC: contains the value of the Program Counter (PC) -Last Address: contains the value of the last prefetched address.
-Delta: a number of calculated deltas -difference between the two successive cache miss addresses -maintained in a circular buffer -Delta index: points to the next available location for delta to be inserted in the circular buffer.
Our implemented DCPT table slightly differs from [5] since the last prefetch field in the table entry is left out in our implementation. The purpose of the last prefetch field is to keep track of the last prefetched address in order to avoid fetching the same block multiple times while the block is in the cache. The absence of last prefetch field in our table is mitigated by the use of in cache function [9] of the given prefetch interface. Each time before fetching a block from the memory, we have called the in cache function to ensure that the memory block is not already being fetched and available in the cache.
Prefetching Algorithm (DCPT-OPT)
The main flow of our implemented Prefetcher follows the DCPT algorithm [8] . At the initial stage, for each cache miss for a memory access, corresponding entry of the DCPT table is updated if the number of deltas is less than three. Once the number of deltas becomes greater than three, the algorithm searches for a matched pattern in the delta queue for the two most recently inserted deltas. If it finds a pattern in the queue, it generates the candidate address for the matching deltas. The initial candidate value is generated by adding the first matched delta with the previous address field. And the second candidate value is generated by adding the second delta with the previously generated candidate value. In the next step, the generated candidate addresses are prefetched if they are not present in the cache or waiting in the MSHR queue. In order to avoid the implemented prefetch algorithm being too aggressive, we have modified the delta calculation algorithm presented in [5] . We limit the length of a matching pattern up to three and avoid prefetching more than three blocks at a time for the same PC values.
If no matching pattern is found queue, then the algorithm looks for the partial matching. To support partial matching, we mask out the 5 least significant bits of the delta values. Since the L2 cache block size is 64 bytes in our framework, by masking out 5 lower bits, we are able to generate the candidate values using DCPT-P [8] 
METHODOLOGY
In order to evaluate the performance of the implemented Prefetcher, we have used SPEC CPU2000 benchmark [10] with M5 simulator [11] . The M5 simulator simulates the architecture is an Out of Order (OoO) CPU with Alpha 21264 microprocessor, 32kB L1 cache (without prefetching) and 1MB L2 cache [9] . The memory bus runs at 400MHz, is 64 bits wide, and has a latency of 30ns. The L2 cache prefetcher is notified every time there is a hit or a miss in L2 cache. The size of the cache block is 64bytes and the maximum number of pending prefetch requests is 100 (M AX QU EU E SIZE). The simulated CPU clock runs at 2GHz while the memory bus has a frequency of 400MHz. The size of the physical memory is 256MB. We have evaluated our performance against some referenced prefetchers such as adaptive sequential, sequential on acccess, sequential on miss, rpt, dcpt, dcpt-p etc. Initially, we have tuned some of the parameters of implemented DCPT-OPT prefetcher in order to get the peak performance. Afterwards, we have compared the results of our implemented prefetcher with the referenced prefetcher.
RESULTS AND DISCUSSION
Performance of the implemented prefetcher is largely influenced by the parameters of it like the number of entries per table, number of deltas in the delta queue etc. Therefore, first we have tuned these performance influencing parameters in order to get the best performance from our implemented prefetcher. Initially, Figure-1 presents the results we have obtained from this experiment. Figure-1 illustrates that with the increase of the number of entries in the DCPT table, the speed up also increases. However, once the number of entries become greater than 90, the steed up becomes stagnant. Therefore, we have kept the number of entries for the DCPT table to 90. In our next experiment, keeping the number of DCPT entries fixed to 90, we have varied the number of deltas per entry for further optimization. The result is presented in figure-2 . From figure-2 it is evident that the performance of the prefetcher increases up to 9 deltas per entry. Afterwards, the speedup remains the same irrespective to the number of deltas. Therefore, the optimal performance can be obtained from our implemented prefetcher by using a DCPT table with 90 entries and 9 deltas for each entry. With this set up, we have compared the achieved speedup of our implemented prefetcher with the reference prefetchers. The results are presented in figure-3 and in figure-4 . From figure-3, it is evident from the obtained result that the implemented prefetcher outperforms all of the referenced prefetchers. The speedup for DCPT-P referenced prefetcher is 1.08 which is best among all of the referenced prefetchers whereas, the achieved speedup from our implemented prefetcher is 1.091 which is 6% faster than the referenced DCPT-P prefetcher. Figure-4 presents a comparison of obtained speedup using different benchmark programs with the reference-prefetchers. In the figure, we can see that for all the benchmark programs except twolf, we are getting speedup greater than 1 for our implemented prefetcher. For this program, we are also getting worse performance compared to other prefetchers. Nevertheless, for most of the programs, we are getting better performance than the other referenced prefetchers. The best speedup is achieved for the ammp benchmark program.
CONCLUSION
In this paper, a DCPT-P based prefetcher which is optimized for the OoO CPU with Alpha21264 microprocessor having 1MB L2 cache is presented. In order to achieve optimal performance, the DCPT table structure has been modified and DCPT parameters are tuned for the test platform. Afterwards its performance is evaluated on the M5 simulator. The obtained results reveal that the implemented prefetcher algorithm provides better performance compared to other reference prefetchers. Though achieved overall speedup is 1.091, for two benchmark programs, the achieved speedup is below 1. This problem will be addressed in our future work in order to get further performance improvement from the implemented prefetcher. 
