Abstract
Introduction
It is a well-established fact that as processor speed increases, memory becomes a serious performance bottleneck. While the introduction of caches significantly alleviated the problem, caching alone will not bridge the increasing performance gap between multi-issue processors run- ning at very high clock speeds and memory. Data prefetching has been proposed as an additional tool to bridge this gap. Existing hardware prefetching techniques require the prefetching hardware to perform some form of learning and prediction in real-time. This may necessitate a significant investment in hardware, or there may be an impact on the critical path of instruction processing. In the worst case, it can be both. In this paper, we propose a new paradigm that utilizes extensive profiling and powerful off-line learning algorithms. The main contributions of this paper are: ¥ A novel framework to perform off-line trace analysis that permits a wide range of learning algorithms; ¥ A prefetching microarchitecture that is low in hardware requirement and overhead.
Our technique showed significant improvement in prediction accuracy over existing ones.
In Section 2, we will describe some representative previous works on this subject. In Section 3, we will discuss the use of off-line learning algorithms. Our proposed architecture will be presented in Section 4 together with three learning algorithms that we tested. This is followed by experimental setup, results and a conclusion.
Previous Work
Research on memory hierarchy optimization can be classified into three broad categories: software approaches, hardware approaches and hybrid approaches. We will briefly mention some representative work and refer the interested reader to a detailed survey on the matter that was recently published [25] .
In the field of software prefetching early work include that done by Callahan, Kennedy, and Porterfield [4] , and Klaiber and Levy [16] . The former proposed the insertion of data prefetch instructions in data intensive loops while the latter studied efficient architectural support mechanisms for data prefetch instructions. Mowry, Lam and Gupta [22] showed that careful analysis and selective prefetching could provide significant performance improvements in programs with regular nested loops. Other software prefetching techniques includes Speculatively Prefetching Anticipated Interprocedural Dereference (SPAID) [19] , the use of cache miss heuristics to drive prefetching [23] , and the prefetching of recursive data structure proposed by Luk and Mowry [20] .
Hardware approach includes Jouppi's "stream buffers" [12] , Fu and Patel's prefetching for superscalar and vector processors [8, 9] , and Chen and Baer's lookahead mechanism [6] and known as the Reference Prediction Table ( RPT) [7] . Mehrota [21] proposed a hardware data prefetching scheme that attempts to recognize and use recurrent relations that exist in address computation of link list traversals. Extending the idea of correlation prefetchers [5] , Joseph and Grunwald [11] implemented a simple Markov model to dynamically prefetch address references. More recently, Lai, Fide, and Falsafi [18] proposed a hardware mechanism to predict the last use of cache blocks.
Hybrid approaches includes the prefetch arrays proposal by Karlsson, Dahlgren, and Stenstrom [14] and VanderWiel and Lilja's data prefetch controller (DPC) [24] .
Off-line Learning
Hardware predictors operate in two phases -a learning phase and a prediction phase. In the learning phase, the prediction facility is trained. Typically, this involves the updating of a prediction table or automaton. In the prediction phase, the learned table or automaton is used to make prefetch requests. In some schemes, during the prediction phase, the prediction table or automaton may also be updated, i.e. the learning and prediction phases are interleaved.
A major drawback of existing hardware schemes is the need to perform learning and prediction both at run time. This severely limits the type of learning schemes that one can use. We propose overcoming this limitation by taking the learning phase off-line. By using sample traces collected from an application, prediction tables and automata can be trained off-line. This rests on the important assumption that the sample traces used for the training do correctly reflect the behavior of the application during its actual run. The success of hardware prefetch mechanisms, all of which are based on learning past patterns to predict future references, provides strong circumstantial evidence for this.
The factors determining the success of a prefetch scheme are accuracy, timeliness, overhead and coverage. Accuracy refers to the percentage of prefetch requests issued are actually used. An accurately predicted prefetch request is useless if it is issued too early or too late relative to the actual use of the data. Any prefetch mechanism will have an associated overhead (which may be in the form of additional instructions, hardware investment, or increased bus utilization) that must not be too significant. Finally, the scheme must be able to cover most of the load misses. Unlike online schemes, off-line schemes can consider a significantly larger window of the sample trace and/or use more complex analysis and learning algorithms. This generally improves the accuracy of the prediction. Furthermore, by staying focus on program hotspots, coverage is improved. The issue of timeliness and overhead will be discussed when we outline our architectural solution.
Markovian Predictors
In this section, we shall describe our proposed Markovian predictors. Training traces of the application of interest are collected. In our experiments, these traces are first processed through a cache simulator so that we obtained only the miss traces. It should be emphasized that we used a trace generated by using a different input for the application in our experiments. During the sample trace collection phase, the application is also profiled to identify the "hotspots" -sections of code in which most of the load misses occur. These training sequences are then fed to a learning/analysis algorithm that outputs a prediction model for a particular hotspot. The prediction model is essentially a table with entries and Q C §
. In addition, from the trace we compute
which is the frequency of occurrences of § in ( . Next, we fix the size of the prediction table. This allows us to control the amount of hardware support needed. Since in practice, not all miss addresses can be accomodated, we need a hashing algorithm to access the table. Let
be the hashing function that maps § to its entry in the prediction table. We used a lookup mechanism that is similar to cache tag checking. We iterate through the rows of the prediction table. Of all the miss addresses that map to the same row, we pick the one with the highest frequency of occurrences in the sample trace. Let 
Window Markov Predictor
This is a new predictor. Instead of considering only the miss addresses that immediately follows § , we use a window of size .
For our experiments, we chose W to be five. Another important modification is that we do not necessarily use up all V prefetch request slots.
Hidden Markov Model (HMP) Predictor
The Hidden Markov Model (HMP) is a well-known technique that has a wide range of applications [10, 17, 23] . Essentially, it is a Markov chain where each state generates an observation. HMP are known to be very useful for time-series modeling since the discrete state-space can be used to approximate many non-linear, non-Gaussian systems. There are established algorithms to train a HMP such as the Viterbi and Baum-Welch algorithms [13] . We extract the table from the HMP by examining the state transition and output probabilities.
Encoding the Prediction Table Entries
In order to reduce the size of the prediction tables, we used a stride-based encoding scheme. Given an entry from the predictive table derived above, where . There are four cases for the encoding: ¥ Case 1: All four displacements are in the integer interval [-128, 127] . We store all four displacements in a 4-byte word. is stored as a full 4-byte address.
Two additional bits are needed to distinguish the case of the entry. This encoding scheme sacrifices on accuracy but results in a very compact table. The actual table size per hotspot is shown in Table 3 . On the average, the prediction table for each hotspot is about 8Kbyte with about 2.8 predictions per table entry.
The Proposed Hardware Architecture
In this section, we will describe the proposed architectures in which the off-line prediction tables can be effectively deployed. The techniques described can be used to prefetch data into the L1 data cache or the L2 data cache. We begin by assuming a canonical machine with the nonblocking L1 data cache on-chip, a small prefetch buffer, and a L2 data cache that is off-chip but on-die. Fig. 1 and Fig. 2 show the proposed architecture for L1 and L2 prefetching, respectively. We have described how the prediction tables are constructed off-line, and shall now describe how the scheme will work at runtime. By means of the training trace, a special "load-predictor [table-addr]" instruction is inserted into earliest branch that, in the trace, leads to a new hotspot as shown in Fig. 3 . An important issue is whether there is sufficient time to preload the predictor table. If we assume that the table is 8Kbyte, and the bus width for the L1 and L2 architecture is 256 bits and 128 bits, then for a 2 GHz processor using 400MHz quad-pumped bus, we estimate that it will take about 1500 CPU cycles (inclusive of startup latencies) to load in a 8Kbyte table. Table 2 shows average distance in terms of clock cycles between neighboring hotspots. It should be noted that the distance between neighboring hotspot candidates was also an important consideration in the final choice of hotspots.
Once the table is loaded, the prefetch engine will examine the miss addresses reported by the cache unit. Using the standard tag checking mechanism, the prefetch engine will probe the prediction table. When there is a hit in the prediction table, the prefetch engine will decode the entry and issue the prefetch requests.
The mechanism for L2 prefetch is a variation of the L1 mechanism except that instead of requiring an additional port to L2 memory, the table is fetched by cycle-stealing from the main memory bus. 
Training input
Testing input 130.li input1/train.lsp input2/*.lsp 181.mcf input train/inp.in input ref/inp.in 183.equake input train/inp.in input ref/inp.
Experimental Setup
We use the Trimaran compiler-EPIC architecture simulation infrastructure [3] to evaluate the performance of our proposed system and of each of the three off-line learning algorithms outlined above. We compared the performance of our system against that of using larger caches, and the RPT hardware prefetch scheme of Chen and Baer [7] . For evaluation, we used 130.li of SPEC 95, 181.mcf, 183.equake, 164.gzip, 188.ammp all from the SPEC 2000 suite [2] and bisort, mst, treeadd, tsp, health from wellknown Olden Pointer Benchmark suite.
Our baseline setup is an IA64-like EPIC machine [15] with four integer, two floating point and two memory units and a 32Kbyte L1 cache and a 256Kbyte L2 cache. We computed stall cycles for L1 and L2 load misses when L1 cache size is 32K, 64K and 128K with 256K L2 cache.
Our main metric for characterizing the performance of the memory system is stall cycles. Stall cycles account for a significant portion of the execution times of data intensive applications. Most of the memory stall cycles come from load misses and hence reducing load misses has a significant impact in improving overall performance. Our EPIC machine is an in-order machine, and we assumed a "stallupon-use" latency model. In this stalling model, a load instruction that causes a cache miss will not immediately block the pipeline. The pipeline is stalled only at the earliest attempt to use the data that is to be loaded.
We first built the prediction table per each hot spot for each benchmark we tested using training input sets through offline learning methods. Then we ran the simulation again using different input sets and generated load miss traces for level 1 and level 2 cache misses. The training and testing input for the experiments are described in Table 1 .
For each benchmark, we selected certain basic blocks where most load misses occurred through profiled information and assign them as candidates of hot spots. To be chosen as a hot spot, there should be a large enough gap be-tween the neighboring hotspots. For example, the Treeadd benchmark of Olden Pointer benchmark suite comprises of 33 total basic blocks and load miss occurred in only 11 of those 33 blocks. Moreover 75% of entire load misses came from one particular basic block, basic block number 4 of treeadd procedure. We chose this basic block as our first hot spot and next candidate for hotspot was block number 6 of treeadd procedure where 20% of entire load misses came from. But the average latency between this block and block number 4 was just 388 cycles which was less than our threshold of 5000 for choosing hotspots, so even though basic block number 6 was one with second most load misses, it was not chosen as our hot spot during our experiments. Basic block number 6 of treealloc procedure was chosen as our second hotspot since its average latency to the chosen hotspot was 190,826 cycles ensuring that there is enough time to load the prediction table for this hotspot during runtime. The total number of hot spots ranges from 2 (treeadd) to 19 (130li) as seen in Table 2 . Characteristics of hotspots in the benchmark.
As explained in Section 4.4, we used a stride-based encoding scheme to get a realistic size of prediction tables. Table 3 shows the result of applying this scheme to our implementation. For each benchmark, we measured the average percentage of each 4 cases after Hidden Markov and Window Markov predictors' learning phase ends and they each provide the prediction table. As shown in Table 3 , the Hidden Markov Predictor shows the tendency of having prediction addresses that are far apart in comparison to the Window Markov Predictor of window size 5. This even- tually led to a lesser number of prediction addresses in the prediction table because many addresses that are far away are discarded in the final prediction table. The result show Window Markov Model not only contains more prediction addresses for particular miss address in the encoded prediction table but also its prediction accuracy was much higher than Hidden Markov Model. In Fig. 4 , we tested our Window Markov Model with different window size and the best result came from window size 5. Performance deteriorated as window size is increased. Those results strongly show the existence of data locality characteristics even in pointer intensive applications. Table 4 gives the detailed breakdown of the performance of the Window Markov Predictor. It shows that the predictor do indeed reduce the overall number of load misses in the applications. Columns 6 and 7 report the coverage of the predictor. This is the percentage of load misses in the hotspots that hit the prediction table causing prefetch requests to be sent out. The last two columns is the ratio of wasted prefetch requests (i.e. mispredictions) for each (overall) load miss. We argue that although we did not simulate actual bus transactions and bandwidth, these ratios indicate that the overhead caused by prefetch requests is low. We attribute this to the good accuracy and coverage of the predictor. Fig. 5 shows the effect of increasing miss penalties on the various schemes that we tested on the 188.ammp benchmark. In the top diagram, the L2 miss penalty is fixed at 93 cycles. This was obtained from the actual measurements reported [1] . L1 miss penalty was varied from 12 to 38. In the lower diagram, a L1 penalty of 25 is assumed while the L2 penalty was varied from 30 to 162. What is interesting to note is that the slope for the Window Markov Predictor is gentler than that of others. The gap in memory and processor speed is increasing resulting in larger miss penalties. The Window Markov Predictor seems to show more promise than the other schemes in tolerating larger penal- ties, especially in the L2 cache. The percentage performance improvements for L1 and L2 prefetching are shown in normalized graphs of Fig. 6 and Fig. 7 with the base case being that of a machine with 32KByte L1 cache and 256KByte L2 cache without using any prediction scheme. We measured performance improvement by dividing total execution cycles after certain prefetching scheme was applied by total execution cycles without any prefetching scheme. The results shows that increasing L1 cache size does not necessarily improve performance especially for data intensive applications using dynamic data structures like pointers. In one instance, a 47% improvement in performance was recorded using Window Markov Predictor. In almost all cases, the use of off-line learning algorithms gave a pronounced performance improvement over that of simply increasing the cache size or a hardware prefetch scheme like RPT. In particular, the Window Markov Predictor gives the best performance.
Conclusion
In this paper, we proposed a paradigm and architectural framework for the use of off-line learning algorithms in the prefetching of data. In all the benchmarks that we tested, our off-line learning scheme gave improved performance more significantly than other schemes such as increasing the cache sizes. The off-line approach allows for even more aggressive analysis and prediction schemes. Our future research seeks to develop more powerful learning module. Furthermore, we believe that off-line learning can also be adapted to software prefetching, and we are currently 
