Hardware prefetching is one of the latency tolerance optimization techniques that tolerate costly DRAM accesses. Though hardware prefetching is one of the fundamental mechanisms prevalent on most of the commercial machines, there is no prefetching technique that works well across all the access patterns and different types of workloads. Through this paper, we propose Arsenal, a prefetching framework which allows the advantages provided by different data prefetchers to be combined, by dynamically selecting the best-suited prefetcher for the current workload. Thus effectively improving the versatility of the prefetching system. It bases on the classic Sandbox prefetcher that dynamically adapts and utilizes multiple offsets for sequential prefetchers. We take it to the next step by switching between prefetchers like Multi look Ahead Offset Prefetching and Timing SKID Prefetcher on the run. Arsenal utilizes a space-efficient pooling filter, Bloom filters, that keeps track of useful prefetches of each of these component prefetchers and thus helps to maintain a score for each of the component prefetchers. This approach is shown to provide better speedup than anyone prefetcher alone. Arsenal provides a performance improvement of 44.29% on the single-core mixes and 19.5% for some of the selected 25 representative multi-core mixes.
INTRODUCTION
Most modern prefetchers are designed with a particular scenario in mind and thus give better performance only when the cache access pattern matches that scenario. In this work we present ARSENAL, a data prefetching framework that dynamically selects the best-suited prefetcher among its components for the current workload and deploys it, which ensures the highest possible speedup irrespective of the cache access pattern type. As proof of concept, we present two cases, One at L1D cache level and another at L2D.
Literature survey and Motivation
To understand the effectiveness of state-of-the-art prefetchers on a common scale framework, we analysed the performance of various L1 Cache centric prefetchers like TSKID [1] , MLOP [2] , Bingo [3] , pangloss [4] as well as L2 cache centric prefetchers like SPP [5] , VLDP [6] , Best-offset [7] using trace-based simulator, Champsim. We used traces for SPEC CPU 2017 to compare their performance. In the case of L1D centric prefetchers T-SKID comes up as a clear winner for the single-core mixes, in terms of overall performance; however, there are workloads in which TSKID underperforms compared to other prefetchers. For example, in cases of GCC and fotonik3d traces, MLOP provides greater speedup than SKID. Detailed analysis across the benchmark, as shown in Figure1, reveals that much higher a speedup if we pick the best performing prefetcher for each workload. A similar observation was made for L2D cache centric prefetchers, as summarised in Figure2. This lead to the inception of the idea to recognize the type of workload dynamically and deploy the suitable prefetcher on the run. For such a framework to provide the maximum benefit with a minimum number of component prefetchers (thus minimum overhead), the components chosen have to orthogonal i.e., give good performance for complementary sets of workloads. The analysis leads to the conclusion that SKID [1] and MLOP [2] formed such a pair among the L1d centric prefetchers, so these were chosen for the first test case targeting L1d cache. Among L2d centric prefetchers, SPP [5] and IP-stride were chosen.we have explored all the conventional and state-of-the-art prefetchers, and prefetchers that appear in Data Prefetching Championship 1 [8, 9, 10, 11, 12, 13, 14, 15] , Data Prefetching Championship 2 [16, 17, 18, 19, 20, 6, 5, 7] , and Data Prefetching Championship 3 [21, 22, 23, 1, 4, 3, 2] to find this best combination. In this article, we present these two cases as proof of concept for the arsenal framework.
IMPLEMENTATION
In this section, we provide implementation details of the arsenal framework for both test cases. Portions with no distinction for the test case 1 and test case 2 are common for both.
Component Prefetchers
Here we introduce the different component prefetchers that we have analyzed and eventually used as a proof of concept for our Arsenal prefetching framework.
Test Case 1
Timing SKID [1] T-SKID prefetcher utilizes the repetitive access patterns spread over a larger instruction window, which the conventional prefetchers fail to recognize because of a short instruction window. Cache misses, even if predicted and prefetched successfully, maybe evicted before being accessed because of intermediary thrashing. The T-SKID learns these access patterns and effectively controls the prefetch tim- ing based on a PC, which has a strong correlation of memory access patterns even indifferent address zones. Multi-Lookahead Offset Prefetcher [2] Evaluates different prefetching offsets on the two metrics of timeliness and miss coverages, as many of the conventional offset prefetchers either neglect timeliness or sacrifice miss coverage while selecting the optimum offset for prefetch. The state of the art offset prefetchers generally lose on cache miss coverage because of their reliance on a single best offset, which generates most timely prefetch requests, however, instead of such a binary classification, MLOP considers multiple lookaheads for every prefetch offset and scores them individually. It then selects one offset for each lookahead level and thus allows prefetcher to issue enough requests while still considering the timeliness of these prefetch requests.
Test Case 2
IP-stride is a stride prefetcher that can handle stride patterns based on instruction pointer. It maintains a table of previous addresses accessed by a list of instruction pointers. When the same instruction is executed again, a stride is calculated in the address accessed and a prefetch request is made based on it. Replacement of stored IPs is based on LRU algorithm. Signature Path Prefetcher (SPP) [5] stores the stride patterns in a compressed form in the signature table (ST). Each entry in the ST is used to index into the pattern table (PT), which is used to predict the next stride and also contains the confidence for the current prefetch. The signature is then updated with the latest stride and is used to recursively lookup the PT to predict more strides. This goes on until the confidence, which is multiplied with the last prefetch confidence goes below a certain threshold. The GHR stores prefetch requests that cross page boundaries so that prefetching can take place across pages. Next line is one of the simplest prefetchers which prefetches the next cache line on each cache miss or prefetch hit. Here we used a modified version, which varies its aggressiveness or the number of cache lines prefetched based on its score.
The Arsenal Framework

Gathering performance parameters
Arsenal is motivated by the basic sandbox prefetcher [24] , which searches and selects the best offset for an application. With Arsenal, we try to select the best prefetcher among the available components. The Arsenal framework is trained with prefetch activation events(PAE), i.e., cache Misses and cache prefetch hits. The framework works in two phases: (i) a continuous evaluation phase and (ii) a selection phase. The selection phase is triggered when the evaluation count (number of prefetcher calls) of all the component prefetchers crosses a threshold, which is considered after careful examination. At the end of every selection phase, the best-suited prefetcher is selected using the parameters gathered during the evaluation phase (in some cases, none might get selected). At each PAE, all the prefetchers are triggered by the Arsenal framework. Cache lines prefetched by the prefetchers are stored in their respective boom filters [25] without passing them along to the prefetch queue. Only the prefetch requests of the prefetcher that is selected during the last selection phase, are passed to the prefetch queue i.e.actually prefetched. Also, the evaluation counter of each of the prefetchers is incremented by one, and the prefetch count is incremented by the number of prefetch requests. At every miss and prefetch hit, the cache line address corresponding to the demanded address is compared to the contents of each of the Bloom filters. If any of the Bloom filters produce a match, then the corresponding prefetcher score is incremented by SCORE-INC; otherwise, the score is decremented by SCORE-DEC. A switch from the evaluation to the selection phase happens when all of the evaluation counters exceed EVAL-CNT.
Prefetcher Selection
Test case 1
As T-SKID and MLOP are intelligent prefetchers that adjust their own aggressiveness based on a feedback mechanism number of prefetches attempted by these are also considered in addition to prefetchers' scores. Specifically, when a wrong (less favorable) prefetcher is selected, it's score might get inflated, leading to a faulty cycle where the wrong prefetcher will keep getting selected. The number of prefetches attempted can be used to correct this. If Tskid score is higher or if T-skid attempts more prefetches than TSKID_SELECTION_ATTEMPT T-skid is selected. If MLOP score is higher or if MLOP prefetch attempts are equal to Tskid, MLOP is selected.
Test case 2
If SPP or IP-stride have the maximum score and it is greater than the MIN-SCORE threshold, the prefetcher is selected. If the score of the next line is the maximum among the three and it is greater than a threshold called NEXT-LINE-MIN-SCORE, then the next line is selected. If the score of next-line is the maximum of the three, but it is not higher than NEXT-LINE-MIN-SCORE, the one among SPP andIP-stride, which has a higher score (and also greater thanMIN-SCORE) is selected. If none of the scores cross their respective threshold, then no prefetcher is selected. Hardware Overhead: Table 1 shows the hardware overhead of Arsenal framework. This is the overhead of the framework alone i.e. in addition to the memory overhead requirements of the component prefetchers. Table 2 shows the hardware overhead for the two test cases taking into account the memory overhead requirements of the component prefetchers.
EVALUATION AND RESULTS
We used traces of SPEC CPU 2017 to evaluate the performance of Arsenal framework. Figure 4 shows the normal- 
CONCLUSION AND FUTURE WORK
This paper proposed the Arsenal framework that selects the best prefetcher from three prefetchers using a sandbox method. The framework uses Bloom filters to test the effectiveness of all the prefetchers. Arsenal provides an average performance improvement of 44.29% for the single-core traces. The effectiveness of Arsenal will improve if the framework gets multiple prefetchers that compliment each other: like a combination of regular and irregular prefetchers. Exploring the same along with modeling of DRAM contention for multi-cores is an exciting avenue for future work. Further research is also required to make the selection process adaptive so the framework can modify its selection criterion on the run if it encounters new workloads.
ACKNOWLEDGEMENT
Thanks to Biswabandan Panda, IIT Kanpur for his valuable suggestions.
