For SOR-like PDE solvers, loop tiling either helps little in improving data locality or hurts their performance. This paper presents a novel compiler technique called code tiling for generating fast tiled codes for these solvers on uniprocessors with a memory hierarchy. Code tiling combines loop tiling with a new array layout transformation called data tiling in such a way that a significant amount of cache misses that would otherwise be present in tiled codes are eliminated. Compared to nine existing loop tiling algorithms, our technique delivers impressive performance speedups (faster by factors of 1.55 -2.62) and smooth performance curves across a range of problem sizes on representative machine architectures. The synergy of loop tiling and data tiling allows us to find a problem-size-independent tile size that minimises a cache miss objective function independently of the problem size parameters. This "one-size-fits-all" scheme makes our approach attractive for designing fast SOR solvers without having to generate a multitude of versions specialised for different problem sizes.
Introduction
As the disparity between processor and memory speeds continues to increase, the importance of effectively utilising caches is widely recognised. Loop tiling (or blocking) is probably the most well-known loop transformation for improving data locality. This transformation divides the iteration space of a loop nest into uniform tiles (or blocks) and schedules the tiles for execution atomically. Under an appropriate choice of tile sizes, loop tiling often improves the execution times of array-dominated loop nests.
However, loop tiling is known not to be very useful (or even considered not to be needed [15] ) for 2D PDE (partial differential equations) solvers. In addition, tile size selection algorithms [5, 6, 12, 14, 18] target only at the 2D arrays accessed in tiled codes. To address these limitations, Song and Li [16] propose a new tiling technique for handling 2D Jacobi solvers. But their technique does not apply to SOR (Successive Over-Relaxation) PDE solvers. Rivera † This work is supported by an ARC Grant A10007149.
‡ The author was performing part of his PhD studies at UNSW when this work was carried out. He was also supported by the same ARC grant.
and Tseng [15] apply loop tiling and padding to tile 3D PDE codes. However, they do not exploit a large amount of the temporal reuse carried by the outermost time loop. In this paper, we present a new technique for improving the cache performance of a class of loop nests, which includes multidimensional SOR PDE solvers as a special case.
Our compiler technique, called code tiling, emphasises the joint restructuring of the control flow of a loop nest through loop tiling and of the data it uses through a new array layout transformation called data tiling. While loop tiling is effective in reducing capacity misses, data tiling reorganises the data in memory by taking into account both the cache parameters and the data access patterns in tiled code. By taking control of the mapping of data to memory, we can reduce the number of capacity and conflict misses (which are referred to collectively as replacement misses) methodically. In the case of SOR-like PDE solvers assuming a direct-mapped cache, our approach guarantees the absence of replacement misses in every two consecutively executed tiles in the sense that no memory line will be evicted from the cache if it will still be accessed in the two tiles (Theorems 4 and 6). Furthermore, this property carries over to the tiled code we generate for 2D SOR during the computation of all the tiles in a single execution of the innermost tile loop (Theorem 5). Existing tile size algorithms [5, 6, 12, 14, 18] cannot guarantee this property.
The synergy of loop tiling and data tiling allows us to find a problem-size-independent tile size that minimises a cache miss objective function independently of the problem size parameters. This "one-size-fits-all" scheme makes our approach attractive for designing fast SOR solvers for a given cache configuration without having to generate a multitude of versions specialised for different problem sizes.
We have evaluated code tiling for a 2D SOR solver on four representative architectures. In comparison with nine published loop tiling algorithms, our tiled codes have low cache misses, high performance benefits (faster by factors of 1.55 -2.62), and smooth performance curves across a range of problem sizes. In fact, code tiling has succeeded in eliminating a significant amount of cache misses that would otherwise be present in tiled codes.
The rest of this paper is organised as follows. Section 2 defines our cache model. Section 3 introduces our program model and gives a high-level view of our code tiling strategy. Section 4 describes how to construct a data tiling trans-formation automatically. Section 5 focuses on finding optimal problem-size-independent tile sizes. Section 6 discusses performance results. Section 7 reviews related work. Section 8 concludes and discusses some future work.
Cache Model
In this paper, a data cache is modeled by three parameters: C denotes its size, L its line size and K its associativity. C and L are in array elements unless otherwise specified. Sometimes a cache configuration is specified as a triple (C, L, K). In addition, we assume a fetch-on-write policy so that reads and writes are not distinguished.
Definition 1 (Memory and Cache Lines)
A memory line refers to a cache-line-sized block in the memory while a cache line refers to the actual block in which a memory line is mapped.
From an architectural standpoint, cache misses fall into one of three categories: cold, capacity, and conflict. In this paper, cold misses are used as before but capacity and conflict misses are combined and called replacement misses.
Code Tiling
We consider the following program model:
where I = (I 1 , . . . , I m ) is known as the iteration vector, M is an n × m integer matrix, the loop bounds p k and q k are affine expressions of the outer loop variables I 1 , . . . , I k−1 , the vectors c 1 , . . . , c η are offset integer vectors of length n, and f symbolises some arbitrary computation on the η array references. Thus, A is an n-dimensional array accessed in the loop nest. In this paper, all arrays are in row major. As is customary, the set of all iterations executed in the loop nest is known as the iteration space of the loop nest:
This program model is sufficiently general to include multi-dimensional SOR solvers. Figure 1 depicts a 2D version, where the t loop is called the time loop whose loop variable does not appear in the subscript expressions of the references in the loop body. In addition, the linear parts M of all subscript expressions are the identity matrix, and the offset vectors c 1 , . . . , c η contain the entries drawn from {−1, 0, 1}. These solvers are known as stencil codes because they compute values using neighbouring array elements in a fixed stencil pattern. The stencil pattern of data accesses is repeated for each element of the array.
Without loss of generality, we assume that the program given in (1) can be tiled legally by rectangular tiles [19] . For the 2D SOR code, tiling the inner two loops is not beneficial since a large amount of temporal reuse carried by the time loop is not exploited. Due to the existence of the dependence vectors (1, −1, 0) and (1, 0, −1), tiling all three loops by rectangles would be illegal [19] . Instead, we skew the iteration space by using the linear transformation Figure 2 . We choose to move the time step inside because a large amount of temporal reuse in the time step can be exploited for large P .
Loop tiling can be understood as a mapping from the iteration space S to Z Z 2m such that each iteration (I 1 , . . . , I m ) ∈ S is mapped to a new point in Z Z 2m [7, 19] : According to this definition, there are four kinds of cache misses in tiled code: cold, intra-tile, inter 1 -tile and inter 2 -tile.
Let u and u be two adjacent tiles. If a tiled loop nest is free of intra-and inter 1 -tile misses in both tiles, then no memory line will be evicted from the cache during their execution if it will still be accessed in u and u , and conversely. Figure 3 gives a high-level view of code tiling for directmapped caches. In Step (b), we construct a data tiling g to map the n-dimensional array A to the 1D array B such that the tiled code operating on B is free of intra-and inter 1 -tile misses (Definition 4). There can be many choices for such tile sizes. In Step (c), we choose the one such that the number of inter 2 -tile misses is minimised. The optimal tile size found is independent of the problem size because our cost function is (Section 5). Finally, our construction of g ensures that the number of cold misses in the tiled code has only a moderate increase (due to data remapping) with respect to that in the original program. Figure 3 ) is called a data tiling if the tiled code given is free of intra-and inter 1 -tile misses.
Definition 4 (Data Tiling
For a K-way set-associative cache (C, L, K), where K > 1, we treat the cache as if it were the direct-mapped cache
. As far as this hypothetical cache is concerned, g used in the tiled code is a data tiling transformation. By using the effective cache size K−1 K C to model the impact of associativity on cache misses, we are still able to eliminate all intra-tile misses for the physical cache (Theorem 3). In the special case when K = 2, the cache may be under utilised since the effective cache size is only C/2. Instead, we will treat the cache as if it were (C, L, 1). The effectiveness of our approach has been validated by extensive experiments conducted (only) on set-associative caches.
Data Tiling
In this section, we present an algorithm for automating the construction of data tiling transformations. Throughout this section, we assume a direct-mapped cache, denoted by (C, L, 1). where the cache size C and the line size L are both in array elements. An application of the results in this section for set-associative caches is discussed in Section 3.
We will focus on a loop nest that conforms to the program model defined in (1) with the iteration space S given in (2). We denote by offset(A) the set of the offset vectors of all η array references to A, i.e., offset(A) = {c 1 , . . . , c η }. The notation e i denotes the i-th elementary vector whose i-th component is 1 and all the rest are 0.
Recall that a loop tiling is a mapping as defined in (3) and that T = (T 1 , · · · , T m ) denotes the tile size used. Let S T be the set of all the tiles obtained for the program:
Let T (u) be the set of all iterations contained in the tile u:
In this definition, the constraint I ∈ S from (4) is omitted. Thus, the effect of the iteration space boundaries on T (u) is ignored. As a result, |T (u)| is invariant with respect to u. For notational convenience, the operator mod is used as both an infix and a prefix operator. We do not distinguish whether a vector is a row or column vector and assume that this is deducible from the context.
Let addr be a memory address. In a direct-mapped cache (C, L, 1), the address resides in the memory line addr/L and is mapped to the cache line mod( addr/L , C/L).
In Section 4.1, we give a sufficient condition for a mapping to be a data tiling transformation. In Section 4.2, we motivate our approach by constructing a data tiling transformation for the 2D SOR program. Section 4.3 constructs data tiling transformations for the programs defined in (1).
A Sufficient Condition
For a tile u ∈ S T , its working set (i.e., the set of elements accessed inside u), denoted D(T (u)), is given by: 
It is easy to show that D(T (u)) is a translate of D(T (u ))
for u, u ∈ S T . This property plays an important role in our development, which leads directly to the following result. Figure 2 . This theorem implies that the number of elements that are accessed in u but not in u , i.e., |D(T (u)) \ D(T (u ))| is exactly the same as the number of elements that are accessed in u but not in u, i.e.,
Theorem 1 Let u, u ∈ S T be two adjacent tiles, where
) mod C and use the mapping to map the element A(M I + c) to B(ψ(M I + c)), where c ∈ offset(A), then the two corresponding elements in the two sets will be mapped to the same cache line. By convention, D(S) is the union of D(T (u)) for all u ∈ S T . As a result, the newly accessed data in the set D(T (u )) \ D(T (u)) when u is executed will evict from the cache exactly those data in the set
However, this does not guarantee that all intra-and inter 1 -tile misses are eliminated. Below we give a condition for data tiling to guarantee these two properties.
For a 1-to-1 mapping g :
holds, where w 1 , w 2 ∈ W , then the following must hold:
Proof. Follows from Definition 4 and the definition of g.
Theorem 3
Consider a K-way set-associative cache (C, L, K) with an LRU replacement policy, where
then there are no intra-tile misses in the tiled code from Figure 3.
Proof. For the g given, there can be at most K − 1 distinct memory lines accessed during the execution of any single tile. By Definition 4, there cannot be any intra-tile misses.
In the case of LRU, we tend to reduce also inter 1 -tile misses by using K−1 K C as the effective cache size.
Constructing a Data Tiling for 2D SOR
In this section, we construct a data tiling transformation to eliminate all intra-and inter 1 -tile misses for 2D SOR. We will continue to use the example given in Figure 4 . Since the array A is stored in row major, the elements depicted in the same row are stored consecutively in memory. In Step (b) of Figure 3 , we will construct a data tiling g to map A to B such that the elements of B will reside in the memory and cache lines as illustrated in Figure 5 . (It should be pointed out that g is not a block-cyclic array layout transformation.)
The basic idea is to divide the set of all elements of A into equivalence classes such that A(i, j) and A(i , j ) are in the same class if i = i mod (T 1 + T 3 + 1) and j = j mod (T 2 +T 3 +L). For all array elements of A in the same equivalence class, we will construct g such that their corresponding elements in the 1D array B have the same memory address (modulo C). In other words, A(i, j) and A(i , j ) are in the same class iff g(i, j) = g(i , j ) mod C. In Figure 5 , the two elements of A connected by an arc are mapped to the same cache line. This ensures essentially that the unused elements that are accessed in a tile will be replaced in the cache by the newly accessed elements in its adjacent tile to be executed next. As mentioned earlier, this does not guarantee the absence of intra-and inter 1 -tile misses. To eliminate them, we must impose some restrictions on (T 1 , T 2 , T 3 ). For example, a tile size that is larger than the cache size will usually induce intra-tile misses.
In the 2D SOR program given in Figure 2 , the linear part of an array reference is defined as follows:
Let u = (ii, jj, tt), u = (ii, jj, tt + 1) ∈ S T be two adjacent tiles. T (u) and T (u ) are defined according to (5) .
By Definition 4, it suffices to find a (C, L)-1-to-1 mapping on S T . To do so, we need a 2-parallelotope con- 0) . This parallelotope can be obtained by our algorithms FindFacets and FindQ given in Appendix A. We denote by (G, F (u), K) this parallelotope, where
) and the first and second rows of G are G 1 and G 2 , respectively. According to (G, F (u), K), we classify the points in the data space
e., the set of array indices of A) and find a mapping such that the points in the same class are mapped into the same cache line. We say that two points (i, j) and
, for some integers s and t. Two points are in the same class iff they are equivalent. (For example, the two points connected by an arc in Figure 5 are in the same equivalence class.) Let
Theorem 4 Let a direct-mapped cache (C, L, 1) be given. Then g defined in (9) is a data tiling transformation for the 2D SOR if (T 1 , T 2 , T 3 ) satisfies the following two conditions: 1. L divides both T 2 and T 3 , and 2. (T
Proof. See Appendix A. Therefore, g in (9) for 2D SOR guarantees that the tiled code for the program is free of intra-and inter 1 -tile misses provided the conditions in Theorem 4 are satisfied.
In the example illustrated in Figures 4 and 5 , we have T 1 = T 2 = T 3 = L = 2 and C = 30. Both conditions in Theorem 4 are true. The resulting data tiling can be obtained by substituting these values into (9) .
In fact, our g has eliminated all inter 2 -tile misses among the tiles in a single execution of the innermost tile loop. (9) ensures that during any single execution of the innermost tile loop, every memory line, once evicted from the cache, will not be accessed during the rest of the execution.
Theorem 5 Under the same assumptions of Theorem 4, g defined in
Proof. See Appendix A.
Constructing a Data Tiling for (1)
We now give a data tiling, denoted g, for a program of the form (1). This time we need an r-parallelotope [17] that contains D(T (u)), where r is the dimension of the affine hull of D (T (u) ). This parallelotope, denoted P(T (u)), is found by our algorithm FindFacets. We can see that P(T (u)) is the smallest r-parallelotope containing D(T (u)) if the components of the offset vectors in offset(A) are all from {−1, 0, 1}. Therefore, it is only necessary to map the elements of A that are accessed in the loop nest to B. Hence, g is a mapping from Z Z r to Z Z, where r n.
Let φ(T ) and ψ(T ) be the number of elements contained in D(T (u)) and P(T (u)), respectively. From now on we assume that a tile fits into the cache, i.e., ψ(T ) ≤ C. Let P(T (u)) = (G, F (u), K) be found by FindFacets and Q by FindQ. Without loss of generality, we assume that 0 ∈ D(S) and GQS
We call GQS the data tile space. For the 2D SOR example, we have r = 2, Q = diag(T 1 + T 3 + 1, T 2 + T 3 + 1) and 
. , z r (I)) = GI −Qy(I).
Let v = (v 1 , . . . , v r ) and
Let
where Π r k=r+1 (UB k − LB k ) = 0.
Theorem 6 Let a direct-mapped cache (C, L, 1) be given. Then g defined in (11) is a data tiling transformation for (1)
if the following two conditions are true:
Proof. Under the given two conditions, g is (C, L)-1-to-1 on S T . By Theorem 2, g is a data tiling as desired.
Finding Optimal Tile Sizes
Let a loop nest of the form (1) be given, where A is the array accessed in the nest. Let this loop nest be tiled by the tile size T = (T 1 , . . . , T m ). LetT = (T 1 , . . . , T m−1 , 2T m ) . Using the notation introduced in Section 4.3, φ(T ) represents the number of distinct array elements accessed in a tile and φ(T ) the number of distinct array elements accessed in two adjacent tiles. Thus, φ(T ) − φ(T ) represents the number of new array elements accessed when we move from one tile to its adjacent tile to be executed next.
Our cost function is given as follows:
For each tile size that induces no intra-and inter 1 -tile misses under data tiling, the number of cache misses (consisting of cold and inter 2 -tile misses) in the tiled code is dominated by |S T |/f (T ). Hence, the optimal tile size is a maximal point of f such that the conditions in Theorem 6 (or those in Theorem 4 for 2D SOR are satisfied). Of all tile sizes without intra-and inter 1 -tile misses, we therefore take the one such that the number of inter 2 -tile misses is minimised. Hence, the total number of cache misses is minimised. The set of all tile sizes is {(T 1 , . . . , T m ) : 1 T 1 , . . . , T m C}. The optimal one can be found efficiently by an exhaustive search with a worst-time complexity being O(C m ), where C is the cache size in array elements (rather than bytes). (The worst-time complexity when m = 2 can be tightened to be O(C log C).) Essentially, we simply go through all tile sizes that satisfy the conditions mentioned above and pick the one that is a maximal point of f .
Next we provide a characterisation of cache misses for a program p of the form (1) when L = 1; it can be generalised to the case when L > 1. Let OMN(T ) be the smallest among the cache miss numbers of all tiled codes for p obtained using the traditional loop tiling under a fixed T but all possible array layouts of A. Let DTMN(T ) be the cache miss number of the tiled code for p we generate when the layout of A is defined by a data tiling transformation. (C, 1, 1 ) be given. Assume that a 1 > 0, . . . , a m > 0 are constants and the iteration space of (1) 
Theorem 7 Let a direct-mapped cache
Proof. When L = 1, we have the two inequalities:
which together imply the inequality in the theorem.
This theorem implies that when N is large and if we choose T such that f (T ) > f(T ), then the number of cache misses in our tiled code is smaller than that obtained by loop tiling regardless what array layout is used for the array A.
Experimental Results
We evaluate code tiling using the 2D SOR solver and compare its effectiveness with nine loop tiling algorithms on the four platforms as described in Table 1 . In all our experiments, the 2D SOR is tiled only for the first level data cache in each platform.
All "algorithms" considered in our experiments are referred to by the following names: seq denotes the sequential program, cot denotes code tiling, lrw is from [18] , tss from [5] , ess from [6] , euc from [14] , pdat from [12] , and pxyz is the padded version of xyz with pads of 0-8 elements (the same upper bound used as in [14] ).
Our tiled code is generated according to Figure 3 . The program after its Step (a) is given in Figure 6 . The data tiling function g required in Step (b) is constructed according to (9) . The problem-size-independent tile sizes on the four platforms are found in Step (c) according to Section 5 and listed in Table 2 . Note that in all platforms, the optimal T 3 = L holds. The final tiled code obtained in Step (d) is optimised as described in
Step (e). Note that T 3 does not appear in the tiled code given in Figure 6 since the two corresponding loops are combined by loop coalescing.
(50, 60, 4) Alpha 21264 (76, 80, 8) Table 2 . Problem-size-independent tile sizes.
All programs are in ANSI C, compiled and executed on the four platforms as described in Table 1 . The last two platforms are SGI Origin 2000 and Compaq ES40 with multiple processors. We used only one single processor during our experiments. All our experiments were conducted when we were the only user on these systems.
The SOR kernel has two problem size parameters P and N . In all our experiments except the one discussed in Figure 11 , we fix P = 500 and choose N from 400 to 1200 at multiples of 57. Figure 7 shows the performance results on Pentium III. Figure 7 (a) plots the individual execution times, showing that all tiled codes run faster than the sequential program except for ess at the larger problem sizes. But our tiled codes perform the best at all problem sizes (represented by the curve at the bottom). Figure 7(b) highlights the overall speedups of all tiled codes over the sequential program. This implies that code tiling is faster by factors of 1.98 -2.62 over the other tiling algorithms, as shown in Figure 7(c) . Figure 8 shows the performance results on Pentium 4. This time, however, loop tiling is not useful as shown in Figure 8 (a). Figure 8(b) indicates that neither of the existing tiling algorithms yields a positive performance gain but code tiling attains a speedup of 1.56. Figure 8(c) shows that code tiling is faster than these algorithms by factors of 1.56 -1.59. Figure 6 . Tiled 2D SOR code. Some other properties about code tiling are in order.
CPU
Copy Cost. All execution times include the copy overheads. In the tiled code for 2D SOR, the copy cost contributes only O(1/P ) to the overall time complexity, where P is the number of time steps. We measured the copy cost to be 0.8% -1.2% on Pentium III, 0.1 -1.5% on Pentium 4, 0.1 -1.0% on R10K and 0.1 -1.3% on Alpha 21264 of the total execution time.
Address Calculation Cost. The data tiling functions used involve integer division and remainder operations and is thus expensive. They are efficiently computed by using incremental additions/subtractions and distributing loop nests to avoid excessive max/min operations. where C is the cache size in terms of array elements rather than bytes. For the problem sizes used in our experiments, g(N, N ) ranges from 1.03N 2 to 1.16N 2 . Note that the multiplier is only 1.33 when N = 100. When N is even smaller, tiling is usually not needed. The tiling technique for the Jacobi solvers [16] employs array duplication to remove anti and output dependences. So their constant multiplier is 1.
Cache Misses. To support our claim that code tiling has eliminated a large amount of cache misses present in the tiled codes generated by loop tiling, we evaluated cache performance for all codes involved in our experiments using PCL [13] . Figure 12 plots the real L1 data cache misses for all methods on Pentium III. In comparison with Figure 7 (a), the significant performance gains correlate well with the significant cache miss reductions at most problem sizes. Note that lrw has comparable or even smaller cache miss numbers at some problem sizes. This is because in our tiled codes, some temporaries are required to enable incremental computation of the data tiling function (see Step (d) in Figure 3 ) and they are not all kept in registers due to a small number of registers available on the x86 architecture. Despite of this problem, cot outperforms lrw at all problem sizes. This can be attributed to several reasons (e.g., TLB and L2 misses). 
Related Work
To the best of our knowledge, we are not aware of any previous work on applying a global data reorganisation strategy to minimise the cache misses in tiled codes. Some earlier attempts on partitioning the cache and mapping arrays into distinct cache partitions can be found in [2, 10] . Manjikian et al [10] allocate arrays to equal-sized regions. Chang et al [2] allow varying-sized regions but assume all arrays to be one-dimensional. These techniques cannot handle the 2D SOR solvers since these kernels each use one single arraythere is nothing to partition.
Compiler researchers have applied loop tiling to enhance data locality. Several tile size selection algorithms [5, 6, 12, 14, 18] find tile sizes to reduce the cache misses in tiled codes. Since these algorithms rely on the default linear layouts of the arrays, padding has been incorporated by many algorithms to help loop tiling stabilise its effectiveness [12, 14] .
While promising performance gains in many programs, loop tiling is not very useful for 2D PDE solvers and may even worsen their performance as shown by our experiments. In recognising this limitation, Song and Li [16] present a tiling technique for handling 2D Jacobi solvers. This paper contributes a new technique for improving the performance of multi-dimensional SOR solvers. Rivera and Tseng [15] extend their previous work [14] to 3D solvers but they do not exploit a large amount of temporal reuse carried at the time step as we do here.
Kodukula el al [9] propose a data shackling technique to tile imperfect loop nests. But this technique itself does not tell which tile size to use. Like loop tiling, data shackling is a loop transformation. As such, it does not modify the actual layouts of the arrays used in tiled codes.
Chatterjee et al [3] consider nonlinear array layouts and achieve impressive performance speedups in some benchmarks when they are combined with loop tiling. However, their technique is orthogonal to loop tiling; they rely on a tile size selection algorithm to find appropriate tile sizes. In addition, they choose nonlinear layouts for all the arrays without making any attempt in partitioning and reorganising them in memory. In other words, they do not directly aim at reducing the cache misses in tiled codes. This may partially explain why they obtain increased cache misses in some benchmarks (due to conflict misses between tiles).
The importance of combining data transformations with loop transformations was recognised earlier [4] . Subsequently, several researchers [8, 11] permit the co-existence of different array layouts (row major, column major, diagonal or others) in a kernel or program-wise and obtain moderate performance gains for benchmark programs.
The PhiPAC project [1] uses an exhaustive search to produce highly tuned tiled codes for specific level-3 BLAS kernels, which are specialised not only for a given cache configuration but also a given problem size. Our code tiling methodology generates automatically a single "optimised" version for an SOR PDE solver for all problem sizes.
Conclusion
We have presented a new compiler technique for improving the performance of a class of programs that includes multi-dimensional SOR PDE solvers as a special case. Code tiling combines loop tiling with data tiling in order to reduce cache misses in a predictable and methodical manner. We have evaluated its effectiveness using the classic 2D SOR solver -for which loop tiling is ineffective -on four representative architectures. Our experimental results show that code tiling has eliminated a significant amount of cache misses that would otherwise be present in tiled codes. This translates to impressive performance speedups over nine loop tiling algorithms for a range of problem sizes.
We believe that code tiling can be generalised to other programs, at least to dense matrix codes for which loop tiling is an appropriate means of control flow restructuring for data locality. This will have the potential to eliminate a significant amount of conflict misses still present in tiled codes. Some preliminary results we have obtained on matrix multiplication are extremely encouraging.
