Cache-oblivious and cache-aware algorithms have been developed to minimize cache misses. Some of the newest processors have hardware prefetching where cache misses are avoided by predicting ahead of time what memory will be needed in the future and bringing that memory into the cache before it is used. It is shown that hardware prefetching permits the standard Floyd-Warshall algorithm for all-pairs shortest paths to outperform cache-oblivious and cache-aware algorithms. A simple improvement to the standard simple dynamic programming algorithm yields an algorithm that takes advantage of prefetching, and outperforms cache-oblivious and cache-aware algorithms. Finally, it is shown that variants of standard FFT algorithms exhibit good prefetching performance.
Introduction.
The memory subsystem on modern computers is ubiquitously structured in a hierarchy with registers in the lowest level followed by the L1 cache, L2 cache, main memory and external memory such as hard disks, with memory access time increasing quickly from lower levels to higher levels. For the sake of discussion we consider a two-level model that consists of a cache of size M and an arbitrarily large main memory partitioned into blocks of size B. If the byte is not stored in the cache, the entire memory block where it resides is brought into the cache, and we call this a cache miss. The I/O complexity of an algorithm therefore becomes the number of blocks transferred upon cache misses between these two levels. Cache Oblivious algorithms are algorithms that do not use knowledge of M and B, yet still have good cache performance. On the other hand, Cache Aware algorithms do use knowledge of M and B of the host machine to optimize their cache performance. Together, they are called Cache Efficient algorithms.
Many cache efficient algorithms for problems have been developed that have superior performance to standard algorithms for the same problems. However, a recent development in processor design, hardware prefetching, raises the question as to whether some of these custom cache efficient algorithms are always needed to reduce cache misses. With hardware prefetching, cache misses are avoided by predicting ahead of time what data in memory will be needed in the future and bringing that data into the cache before it is used. In this paper we explore this question and discover that hardware prefetching can be exploited to yield fast algorithms for the all-pairs shortest paths problem, simple dynamic programming, and the Fast Fourier Transform (FFT).
Hardware Prefetcher in the Pentium 4.
In the Pentium 4 processor, associated with the L2 cache is a hardware prefetcher [7] that monitors data access patterns and prefetches data automatically into the L2 cache. It attempts to stay 256 bytes ahead of the current data access locations. This prefetcher remembers the history of cache misses to detect concurrent, independent streams of data that it tries to prefetch ahead of use in the program. It follows one stream per 4KB page (either load or store) and can prefetch up to 8 simultaneous independent streams from eight different 4KB regions. The hardware prefetcher also has a few weaknesses. First of all, it requires rather regular memory access patterns. Moreover, start-up penalty applies before the hardware prefetcher triggers, and there might be unnecessary fetches after the end of an array is reached. For short arrays this overhead can reduce the effectiveness of the hardware prefetcher.
To understand the range and efficiency of the prefetcher, we timed sequences of array accesses with and without the prefetcher enabled. Prefetcher activation is controlled by setting bits 9 and 19 of the IA32 MISC ENABLE model-specific register. More information can be found in Appendix B of Volume 3B of the Intel 64 and IA-32 Architectures Software Developer's Manual [6] .
This study was performed on a machine running Linux 2.6.16-16 using a 3.4 GHz Pentium 4 processor The program first allocates a large array and then traverses it ten times, each time reading from then writing to every n-th byte, where n is the stride length. The results, shown in Figure 1 , use an array of forty million bytes with a stride varying from one to five hundred. The normalized time is reported, meaning the stated values are proportional to the time needed per array access. The given measurements are the medians of seven trials. Running this experiment for other large array sizes gave similar results.
With the prefetcher disabled, we expect the normalized time to depend heavily on the number of cache misses. The L1 and L2 caches use blocks of 64 bytes, so for strides of 63 and under accesses to elements already brought into the cache by previous operations come at a low cost. When the stride length is at least 64, we are effectively measuring the time taken (without normalization) for 10 * (40 * 10 6 )/n cache misses. When the prefetcher is enabled, when n ≤ 256, elements of the array will be brought into the L2 cache. This gives us some improvement, since many elements that would have been drawn from the main memory are instead pulled from the L2 cache. However, the hardware prefetcher requires a few initial misses before it can start prefetching, and it only prefetches from main memory into the L2 cache [7] . Furthermore, the overhead required by the prefetcher actually slows down the array accesses for large strides which are out of range of the prefetcher. These results suggest that hardware prefetching can give significant speedup with sequential accesses to memory that are close together, but that prefetching can actually slow down accesses that are spaced far apart.
3 Cache Efficient Algorithms. Expert programmers have known for many years that reducing the number of cache misses can significantly improve the running time of programs. Much effort has been put into designing the cache efficient versions of various dynamic programming algorithms. These algorithms work by reducing the constant factor in the complexity incurred by the cache misses. One major approach to improving the performance of the cache is to design cache-oblivious algorithms.
The cache-oblivious approach is explored by Frigo et al. in [4] , which discusses the cache performance of cache-oblivious algorithms for matrix transpose, FFT and sorting. Park et al. [10] presented a cache-oblivious implementation of the Floyd-Warshall algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. The cache-oblivious algorithm runs roughly 7 times faster than the Floyd-Warshall algorithm on a Pentium 3 machine. Chowdhury et al. [2] gave a new cache-oblivious framework called the Gaussian Elimination Paradigm (GEP) for Gaussian elimination without pivoting that also gives cache-oblivious algorithms for Floyd-Warshall all-pairs shortest paths in graphs and matrix multiplication, among other problems. New cache-oblivious and cache-aware algorithms for simple dynamic programming based on Valiant's context-free language recognition algorithm are designed, implemented, analyzed and empirically evaluated with timing studies and cache simulations by Cherng et al. [1] .
A major technique in designing cache aware algorithms is blocking, that is partitioning the problem into cache size subproblems and solving each subproblem while its data is in the cache. In the cache-oblivious techniques quite often the problem partitions itself naturally into smaller subproblems in a recursive way. Unfortunately, if the recursion is continued all the way to bottom, then there is a lot of overhead from the recursion. A blocking technique that stops the recursion when the subproblem size reaches the size of the cache (say L2 cache), then solving the problem using a standard iterative approach, often yields a significantly faster program. On the negative side, the blocking technique only works if the cache size can be communicated to the program. In the all-pairs shortest paths problem we are given a directed graph G with vertices indexed {1, 2, . . . , n} and for each directed edge (i, j) an associated non-negative cost c(i, j). For each i and j we wish to find the lowest cost of all paths from i to j, where the cost of a path is the sum of cost of the edges on the path. The Floyd-Warshall algorithm (shown in Algorithm 1) is the standard iterative dynamic programming solution to the all-pairs shortest paths problem [3] . It runs in O(n 3 ) time and works by looking at paths with successively more and more possible interior vertices until all vertices are exhausted. The work is divided into n iterations.
is not an edge, and X[i, i] = 0. The Gaussian Elimination Paradigm or GEP, introduced in [2] , is a general cache-oblivious framework for problems. When specialized to the all-pairs shortest paths problem we obtain the recursive formulation described in Algorithm 2. If the input array X has size larger than 1 × 1 then it is subdivided into four equal size matrices:
In the algorithm, when the base case is called k 1 = k 2 and the array X is 1 × 1 and (i 1 , j 1 ) is the index of the one element in the array. All-pairs shortest paths is then solved by calling F (X, 1, n).
Another cache-oblivious technique is derived from the reduction of path problems to matrix multiplication [5, 9] . For this formulation of matrix multiplication, addition is the min operation and multiplication is the + operation. The recursive step of the Matrix Multiply Paradigm or MMP algorithm can be described elegantly by Algorithm 3. If X is not 1 × 1 then the result X * is computed recursively by dividing X into submatrices of
Algorithm 2: GEP Algorithm half the dimension as in the case of the GEP algorithm. The matrix multiply and accumulate operation is also done recursively using divide-and-conquer.
Algorithm 3: MMP Algorithm
Cache-aware algorithms for the all-pairs shortest paths algorithm can be defined using blocked versions of the cache-oblivious algorithms. The blocked GEP algorithm has a parameter S such that if the subproblem size n ≤ S, then the cost submatrix is computed using the Floyd-Warshall algorithm directly. The blocked MMP algorithm has two parameters S and M . The parameter S is such that if the subproblem size n ≤ S then X * is computed using the standard iterative dynamic program (Floyd-Warshall's Algorithm). The parameter M is such that if n ≤ M then the matrix multiply and accumulate operations are done in the standard way, not with recursive divide-and-conquer.
Experimental
Results for All-Pairs Shortest Paths. We implemented the Floyd-Warshall algorithm, the GEP algorithm, the MMP algorithm, the Blocked GEP algorithm and the Blocked MMP algorithm and conducted various running time experiments. In our implementation we chose to store the matrix, which used 4 Byte integers chosen randomly, in rowmajor order, that is, the rows of the matrix are stored in linear memory by storing row 1, then row 2, and so on. These experiments were run under Red Hat Fedora Core 4 on a 2.8 GHz Pentium 4 with 8 KB L1 data cache (4-way associative with 64 B lines) and 512 KB L2 data cache (8-way associative with 64 B lines). The machine on where the processor resides has 4 GB of main memory. All algorithms were implemented in C++. The compiler used was g++ 4.0.2 20051125 (Red Hat 4.0.2-8, with optimization -O3). In our studies of the all-pairs shortest paths algorithms the normalized time is the average of ten experiments divided by n 3 . Figure 2 shows the running time results for the five algorithms on the Pentium 4. The best block size for the Blocked GEP algorithm (S = 64) and the Blocked MMP algorithm (S = 64, M = 32) are determined experimentally on a problem size of 2048. In contrast to results from prior studies [10] , the Floyd-Warshall algorithm clearly out performs all the other algorithms. This is certainly an unexpected result because the FloydWarshall algorithm does not have the strong temporal locality exhibited by the cache-oblivious and cacheaware algorithms. Therefore, the Pentium 4 must have something that dramatically changes the performance characteristics of the Floyd-Warshall algorithm. Indeed, the Pentium 4 has hardware prefetching, that appears to obviate the need for special algorithms to help cache performance. Figure 3 shows the running time for the five algorithms on the same Pentium 4 machine, only with the hardware prefetcher turned off. Without hardware prefetching, the Floyd-Warshall algorithm is less than half as fast as the best cache-aware algorithm, the blocked GEP 64. This is a result more consistent with the prior results [10] . Examining the Floyd-Warshall Algorithm (Algorithm 1) closely shows that the inner loop accesses two rows, the i-th and k-th, simultaneously. Thus, we have two access streams with a stride of 4 bytes each, which is very amenable to hardware prefetching.
Simple Dynamic Programming.
Another form of dynamic programming is called Simple Dynamic Programming problems in [1] . Input elements x 1 , · · · , x n of a simple dynamic program of size n come from a set X which is the domain of a non-associative semi-ring (U, +, ·, 0), where + is an associative (x + (y + z) = (x + y) + z), commutative (x + y = y + x), idempotent (x + x = x), binary operator and · is a nonassociative, noncommutative, binary operator. The value 0 is +-identity (x + 0 = x) and ·-annihilator (x · 0 = 0 · x = 0). Finally, the operators satisfy the distributive laws (x · (y + z) = x · y + x · z and (y + z) · x = y · x + z · x). The objective is to compute the sum (+) of all ways to generate the product (·) of x 1 , . . . , x n in this order and under all possible groupings for the product.
The simple dynamic programming problem can be solved in O(n 3 ) time by Algorithm 4, which is the Cocke-Kasami-Younger (CKY) algorithm [8, 13] Two other iterative algorithms, the Horizontal Algorithm and Diagonal Algorithm, have the subproblems [12] and runs in O(n 3 ) time. The algorithm, summarized as in [1] , has two recursive routines, the plus (+) and the star (⋆) algorithms. Let X be a square matrix of size n = 2 k . Unlike the previous algorithms, the input is placed just above the diagonal in X, that is, X[i, i + 1] = x i for 1 ≤ i < n with the remainder of the array zero . This means the algorithm handles naturally input lengths of a power of two minus one. Arbitrary lengths can be handled by appropriate padding. If n = 2, then X = X + . Otherwise, partition X into sixteen matrices of size 2 k−1 (nine of which are zero).
Then X + , the DP-Closure of X, is shown in Algorithm 5, and X ⋆ is computed using the Valiant's Star algorithm (Algorithm 6). All operations performed can be done in place using An alternative is a cache-aware algorithm called the Blocked Valiant's Algorithm which chooses two parameters S and M , the first for when to cut off recursive calls to Valiant's Star Algorithm and the second for when to cut off recursive calls to the DP-Closure and matrix multiply and accumulate operations.
Experimental Results for Simple Dynamic
Programming. Using the same experimental setup as for the all-pairs shortest paths study (c.f. Section 4.2) we implemented the Vertical, Horizontal, and Diagonal Algorithms, and the cache-oblivious and cache-aware algorithms for simple dynamic program. Figure 4 shows the results of the running time experiments for the five algorithms (S = 64 and M = 32 were the optimal parameters chosen experimentally). Figure 5 shows the same experiments except with the hardware prefetcher turned off. Unfortunately, the standard algorithm did not benefit much from the hardware prefetcher as it did in the all-pairs shortest paths problem. The cache efficient algorithms still outperform the standard algorithms. Even though the hardware prefetcher definitely improves the running time of the standard algorithms, it is clearly not enough to counter the impact of the cache misses.
Improving Simple Dynamic Programming.
On careful examination of the Vertical Algorithm (Algorithm 4) it can be seen that in the inner loop, the i-th row and j-th column are accessed simultaneously. For large matrices the accesses in the j-th column have a large stride because the row-major order layout of the array. Hence, the prefetching hardware of the Pentium for j = 2 to n do for
Data redundant algorithms based on the horizontal and diagonal algorithms can be defined similarly. Figure  6 shows the results from implementing the data redundant algorithms. The bottom two curves are from the Data Redundant Vertical and Horizontal Algorithms. On the negative side, if memory is a constraint then the data redundant algorithm, which uses twice as much memory as the standard CKY-Algorithm, will suffer page faults. Here ω n denotes the nth complex root of unity. We assume that n is a power of two. The first step of the FFT is to rearrange the input array A by the taking the bitreversal permutation [3] . Whether or not the prefetcher was enabled had little effect on the speed of the bitreversal permutation, so its details are omitted. The remainder of the FFT's computation involves several butterfly operations [3] . Each butterfly operation is just a few steps of complex arithmetic, and only the ordering of the butterfly operations is significant for study under prefetching. 6.1 Prefetcher-Friendly FFT Algorithm. Long sequences of array accesses are desirable for hardware prefetching. In the downwards method, a longer-lasting k loop gives longer sequences of accesses, while in the across method, a long j loop is preferred. This translates into small and large values of m, respectively. We can see some of the benefits of both by having the first few executions of the m loop use the downwards method then having the remaining executions use the across method. This combination of the two standard approaches, described in Algorithm 10, requires a parameter s. This value is expected to be some power of two specifying how many iterations of the m loop will be done with the downwards method before switching.
6.2 Cache-Efficient FFT Algorithm. Another variant of this algorithm, described in Algorithm 11, is designed to be cache-aware. After applying bit-reversal, 
Here l is some power of two less than or equal to n. The remainder of the needed butterfly operations are done by the across method operating on the entire array. The appeal of this approach is that, for a well selected value of l, we can fill the cache (either the L1 or L2) with elements of the array and then perform much of our arithmetic without having cache misses. 
Experimental
Results for the FFT. The FFT implementations of Algorithms 10 and 11 were timed with and without the hardware prefetcher enabled using the same experimental setup as in Section 4.2. The arrays had 2 18 randomly generated pairs of floats, with each pair corresponding to real and imaginary parts. Resulting times (in seconds) are multiplied by 10 8 /(2 18 * log 2 18 ) for normalization. The results are shown in Figure 7 . For the prefetcher-friendly implementation, the x-axis denotes how many of the m loop iterations are done with the downwards method before switching to the across method. Our results suggest that a few iterations of the downwards method followed iterations of the across method gives the fastest times.
As expected, the cache-efficient approach is fastest when we can fill or nearly fill the cache with several elements to be used repeatedly, but is slower when we apply the downwards method to arrays larger than the cache. Here, the x-axis is the number of m loops performed by the downwards method to each subarray (whose sizes also depend on the value of the x coordinate) before the across method is employed. We see a substantial jump in time when we go from 16 m loop iterations using the downwards method to 17. Since 16 iterations uses 2 16 pairs of floats, the needed array takes up 2 16 * 2 * 4 Bytes which is precisely the size of the L2 cache (512 KB) used in the experiment.
Using either the prefetcher-friendly or the cacheefficient implementation would require tuning, that is choosing s or l so as to minimize the running time. Without prefetching enabled, the cache-efficient method gives significant improvement over the prefetcherfriendly method. With the prefetcher enabled, however, both methods give nearly the same minimum running times. This suggests that hardware prefetching eliminates some of the need to design cache-efficient implementations.
