A key step in program optimization is the determination of optimal values for code optimization parameters such as cache tile sizes and loop unrolling factors. One approach, which is implemented in most compilers, is to use analytical models to determine these values. The other approach, used in library generators like ATLAS, is to perform a global empirical search over the space of parameter values.
INTRODUCTION
A key step in performance optimization is the determination of optimal values for code optimization parameters like cache tile sizes and loop unrolling factors. One approach, which is implemented in library generators like ATLAS [1] , FFTW [8] and SPIRAL [12] , is to perform an empirical search over a space of possible parameter values, and select the values that give the best performance. An alternative approach, implemented in most compilers, is to use analytical models, parameterized by the values of key hardware parameters such as the capacity of the cache and the number of registers, to estimate optimal parameter values [2, 4, 14] .
In practice, neither approach is completely satisfactory, particularly in the context of compilers. Empirical global search can take a very long time, especially for complex programs and complex architectures, since the size of the optimization space can be very large. To address this problem, some researchers are investigating more advanced search algorithms such as the simplex method [6] , but it remains to be seen if this approach is effective in reducing search time substantially, without reducing the quality of the produced code. Model-driven optimization on the other hand may result in performance penalties of 10-20% even for a relatively simple code like matrix multiplication [17] , which may be unacceptable in some contexts.
In this paper, we explore a different approach that we believe combines the advantages of both approaches. It is based on (i) modelling, (ii) local search, and (iii) model refinement. We advocate the use of models to quickly estimate optimal parameter values (it is important to realize that reducing library generation time is not the focus of our work; rather, it is to find optimization strategies for generating very high-performance code that can be used in generalpurpose compilers, because they scale to large programs and complex architectures). However, all useful models are abstractions of the underlying machine; for example, it is difficult to model conflict misses, so most cache models assume a fully-associative cache. Therefore, it is possible that a particular model may omit some features relevant to performance optimization for some code. To close the performance gap with code produced by exhaustive empirical optimization, we advocate using local search in the neighborhood of the parameter values produced by using the model. Of course local search alone may not be adequate if the model is a poor abstraction of the underlying machine. In that case, we advocate using model refinement -we study the architecture and refine the model appropriately. Intuitively, in our approach, small performance gaps are tackled using local search, while large performance gaps are tackled using model refinement.
The experiments reported in this paper use a modified version of the ATLAS system that implements this methodology. They show that the combination of model refinement and local search can be effective in closing performance gaps between the model-generated code and the code generated by global search, while keeping code generation time small.
The rest of this paper is organized as follows. In Section 2, we describe the optimization parameters used in the ATLAS system, and the global search process used by AT-LAS to find optimal values for these parameters. We also summarize an analytical model [17] for estimating optimal values of these optimization parameters. In Section 3, we discuss experimental results on a number of machines that reveal the potential for improvements in this model (and in ATLAS). In Section 4, we describe how model refinement and local search can be used to tackle these performance problems. In Section 5, we present experimental results for the same machines as before, showing that this methodology addresses the performance problems identified earlier.
In fact, in most cases, we actually obtain better code than is produced by the ATLAS code generator. We conclude in Section 6 with a discussion of future work. Figure 1 shows a block diagram of the ATLAS system
ATLAS
1 . There are two main modules: Code Generator, and Search Engine. For the purpose of this paper, the Code Generator, given certain optimization parameters which are described in more detail in Section 2.1, produces an optimized matrix-multiplication routine, shown as mini-MMM in Figure 1 . This routine is C code which must be compiled using the native C compiler to produce an executable.
The Search Engine determines optimal values for these optimization parameters by performing a global search over the space of values for these parameters. Values corresponding to each point of the search space are passed to the code generator and the resulting mini-MMM code is run on the actual machine and its performance is recorded. Once the search is complete, the parameter values that give the best performance are used to generate the library. The search process is described in more detail in Section 2.2.
To finish the search in reasonable time, it is necessary to bound the search space. When ATLAS is installed on a machine, it executes a set of micro-benchmarks to measure certain hardware parameters such as the L1 data cache capacity [13] , the number of registers, etc. The search engine uses these hardware parameters to bound the search space for the optimization parameters.
Optimization parameters
Although ATLAS is not a general-purpose compiler, it is useful to view the output of its code generator as if it were the result of applying restructuring compiler transformations to the naïve matrix multiplication code shown in Figure 2 . A detailed description is given in [17] . Here we provide a summary of the relevant points.
To improve locality, ATLAS decomposes an MMM into a sequence of mini-MMMs, where each mini-MMM multiplies 1 The complete ATLAS system includes hand-written code for various routines, which may be used to produce the library on some machines if it is found to outperform the code produced by the ATLAS code generator. Here, we only consider the code produced by the ATLAS code generator.
Figure 2: Naïve MMM Code sub-matrices of size NB × NB. NB is an optimization parameter whose value must be chosen so that the working set of the mini-MMM fits in the L1 cache.
In the terminology of restructuring compilers, the i, j, k loop nest in Figure 2 is tiled with tiles of size NB ×NB ×NB, resulting in two new loop nests -outer and inner. ATLAS uses either j, i, k or i, j, k for the loop order of the outer loop nest, depending on the shape of the matrix arguments. The inner loop nest, which constitutes the mini-MMM, is always in the j , i , k loop order. Each mini-MMM is further decomposed into a sequence of micro-MMMs, as shown in Figure 3 , where each micro-MMM multiplies an MU × 1 sub-matrix of A with a 1 × NU sub-matrix of B and accumulates the result into an MU ×NU sub-matrix of C. MU and NU are optimization parameters that must be chosen so that a micro-MMM can be executed out of the floating-point registers.
In the terminology of restructuring compilers, the mini-MMM loop nest is tiled with tiles of size NU × MU × KU , resulting in a new k , j , i loop nest (which is always executed using this loop order).
The resulting code after these two tiling steps is shown in Figure 4 . To keep this code simple, we have assumed that all loop step sizes evenly divide the corresponding loop bounds exactly. In reality, code should also be generated to handle the fractional tiles at the boundaries of arrays; we omit this clean-up code to avoid complicating the description. To perform register allocation, the micro-MMM loop nest j , k is fully unrolled and the referenced array variables are scalarized. Once the code for loading elements of C is lifted outside the k loop, the body of this loop contains MU + NU loads and MU × NU multiply-add pairs. ATLAS schedules this basic block based on the F M A, Ls, IF , and NF optimization parameters as follows.
1. Intuitively, F M A is 0 if code should be generated without assuming that the hardware supports a fused multiply-add. In that case, dependent multiply and add operations are separated by Ls other multiplies adds. This produces sequence with 2 × MU × NU computation statements (MU × NU if F M A = 1).
2. The MU +NU loads of elements of A and B are injected into the resulting computation sequence by scheduling a block of IF loads in the beginning and blocks of NF loads thereafter as needed.
3. The KU iterations of the k loop are completely unrolled.
4. The k loop is software-pipelined so that operations from the current iteration are overlapped with operations from the previous iteration. Finally, ATLAS copies portions of A, B, and C into sequential memory locations before performing the mini-MMM, if it thinks this would be profitable. The strategy for copying is shown in Figure 4 . ATLAS also incorporates a simple form of tiling for the L2 cache, called CacheEdge; we will not discuss this because our focus in the mini-MMM code, which is independent of CacheEdge.
Global search in ATLAS
It is intuitively obvious that the performance of the generated mini-MMM code suffers if the values of the optimization parameters in Table 1 are too small or too large. For example, if MU and NU are too small, the MU ×NU block of computation instructions might not be large enough to hide the latency of the MU + NU loads, and performance suffers. On the other hand, if these parameters are too large, register spills will reduce performance. Similarly, if the value of KU is too small, there is more loop overhead, but if this value is too big, the code in the body of the k loop will overflow the instruction cache and performance will suffer. The goal therefore is to determine optimal values of these parameters for obtaining the best mini-MMM code.
To find optimal values for the optimization parameters, ATLAS uses a global search strategy called orthogonal line search [11] . This is a general optimization strategy that tries to find the optimal value of a function y = f (x1, x2, ..., xn) by reducing the n-dimensional optimization problem into a sequence of n 1-dimensional optimization problems by ordering the parameters xi in some order, and optimizing them one at a time in that order, using reference values for parameters that have not been optimized yet. Orthogonal line search is an approximate method in the sense that it does not necessarily find the optimal value of a function, but it might come close if the parameters x1, x2, ..., xn are more or less independent. The specific order used in ATLAS is: NB; (MU NU ); KU ; Ls FF , IF , and NF . Details can be found in [18] .
Model-driven optimization
We now describe a model for estimating values for optimization parameters [17] . This model is used to generate mini-MMM code using the ATLAS code generator, as shown in Figure 5 . This model requires the following machine parameters.
• CL1, BL1 -capacity and line size of the L1 data cache
• CI -capacity of the instruction cache • NR -number of floating-point registers • Ls -as measured by the ATLAS micro-benchmark • F M A -existence of a fused multiply-add instruction Figure 6 describes the model. The rationale for the model is as follows.
The simplest model for NB is to choose its value so that all three blocks of matrices A, B, and C can reside in the cache simultaneously. This gives the following inequality.
A more careful analysis shows that when multiplying two matrices, capacity misses can be avoided completely if one of the matrices, a row or a column of another matrix, and an element of the third matrix can be cache resident simultaneously [17] . This analysis assumes that the cache replacement policy is optimal. It yields the following inequality for NB.
Finally, we must correct for non-unit cache line size (BL1) and LRU replacement policy, which yields the model shown in Figure 6 [17].
• Choose largest NB, which satisfies:
• Choose MU and NU as follows:
.
• Use Ls as determined by the ATLAS microbenchmark.
• Use FMA as determined by the ATLAS microbenchmark.
• Set FF = 1 and IF = NF = 2. The micro-MMM produced by the ATLAS code generator requires MU × NU registers for storing the elements of C, MU registers to store elements of A, NU registers for storing the elements of B, and Ls registers to store the temporary results of multiplies, which will be consumed by dependent additions. Therefore it is necessary that
To obtain the model shown in Figure 6 , we solve Inequality(4) assuming NU = MU and set NU to its largest integer solution. We then compute MU as the largest integer solution of Inequality (4), based on the computed value for NU . Finally, we found that performance was relatively insensitive to the values of FF , IF and NF , perhaps because the back-end compilers themselves rescheduled operations, so we fixed their values as shown in Figure 6 .
Discussion
It should be noted that similar optimization parameters are needed even if one uses so-called cache-oblivious algorithms [9] . For example, when implementing a cache-oblivious matrix multiplication code, one needs to switch from recursive to iterative code when the problem size becomes small enough to fit in the L1 cache. This requires a parameter similar to NB [3, 5] .
ANALYZING THE GAP
We compared the performance of code generated by AT-LAS and by the model-driven version of ATLAS on ten different architectures. In this section, we discuss only the machines where there are interesting performance differences between the code generated by the original ATLAS system and by the model-driven version we built.
AMD Opteron 240
On this platform, as well as to some extent on all other x86 CISC platforms, we observed a significant performance gap between the code generated using the model and the code generated by ATLAS (compare "Model" and "Global Search" in Figure 7 (d)). To understand the problem, we studied the optimization parameter values produced by the two approaches. These values are shown in Table 12 (a) (to permit easy comparisons between the different approaches explored in this paper, the parameter values used by all approaches are shown together in that section). Although the two sets of values are quite different, this by itself is not necessarily significant. For example, Figure 7 (b) shows how performance of the mini-MMM code changes as a function of NB, while keeping all other parameters fixed. It can be seen that the values chosen by Global Search (NB = 60) and Model (NB = 88) are both good choices for that optimization parameter, even though they are quite different.
On the other hand, performance sensitivity to MU and NU , shown in Figure 7 (c), demonstrates that the optimal values of (MU , NU ) are (6, 1) . Notice that this graph is not symmetric with respect to MU and NU , because an MU × 1 tile of C is contiguous in memory, but a 1×NU tile is not [15] . Global Search finds (MU , NU ) = (6, 1), whereas the model estimates (2, 1). To clinch the matter, we verified that the performance difference disappears if (MU , NU ) are set to (6, 1) , and all other optimization parameters are set to the values estimated by the model. Since the difference in performance between the code produced by global search and the code produced by using the model is about 40%, it is likely that there is a problem with the model presented in Section 2.3 for determining (MU , NU ). Evidence for this is provided by the line "Local Search" in Figure 7 (d), which shows that there is significant performance gap even if we perform local search, described in more detail in Section 4.2, around the parameter values estimated by the model. In Section 4.1.1, we show how model refinement fixes this problem. 
SUN UltraSPARC IIIi
The optimization parameters derived by using the model and global search are shown in Table 13 (a). Figure 13 (c) presents the MMM performance results. On this machine, Model actually performs about 10% better than Global Search.
However, this platform is one of several that expose a deficiency of the model that afflicts ATLAS Global Search as well. The problem lies in the choice of NB. At first sight, it is surprising that the optimal value of NB is so much larger than the model-predicted value. However, when we studied the sensitivity results closely, we realized that these results could be explained by the fact that on this machine, it is actually more beneficial to tile for the L2 cache rather than the L1 cache. Notice that performance increases with increasing values of NB up to NB ≈ 210. Performance degrades for NB > 210, because three tiles do not fit together in the L2 cache anymore (3 × N 2 B > CL2), and L2 conflict misses start to occur. At NB ≈ 360, there is a second, more pronounced, performance drop, which can be explained by applying Inequality (1) for the L2 cache. After this point, L2 capacity misses start to occur. Notice that there are no drops in performance around NB = 52 and NB = 88 which are the corresponding values for the L1 data cache.
In general, it is desirable to tile for the L2 cache if (1) the cache miss latency for the L1 data cache is close to that of the L2 cache, or (2) the cache miss latency of the L1 data cache is small enough that it is possible to entirely hide almost all of the L1 data cache misses with floating point computations. Of course, another possibility is to do multilevel cache tiling but the ATLAS code generator permits tiling for only a single level of the cache hierarchy. In Section 4.1.2, we show how model refinement and local search solve this problem.
We also noticed that even if we tile for the best cache level, the model sometimes computes an NB value which is slightly larger than the optimum. Our study revealed that this effect arose because we ignored the interaction of register tiling and cache tiling. In particular, the extra level of tiling for the register file changes the order of the scalar operations inside the mini-MMM. Therefore, when computing the optimal NB, one should consider horizontal panels of width MU and vertical panels of width NU instead of rows and columns from matrix A and matrix B respectively. In Section 4.1.2, we show how model refinement can take this interaction into account.
Intel Itanium 2
Figure 9(b) shows the sensitivity of performance to NB on Intel Itanium 2. The values chosen by Model, Global Search, and the best value (30, 80 and 360 respectively) are denoted by vertical lines. As with the SUN UltraSPARC IIIi, tiling for the L1 data cache is not beneficial (in fact,we found out that the Itanium does not cache floating-point values in the L1 cache). On this platform, even the L2 cache is "invisible", and the drops in performance are explained by computing the blocking parameters with respect to the L3 cache whose capacity is CL3 = 3M B. Figure 9 (c) zooms into the interval NB ∈ [300, 400], which contains the value that achieves best performance (NB = 360). As we can see, there are performance spikes and dips of as much as 300 MFlops. In particular, the value of NB = 362 computed by 3 × NB ≤ CL3 is not nearly as good as NB = 360. Values of NB that are divisible by MU and NU usually provide slightly better performance because there are no "edge effects"; that is, no special clean-up code needs to be executed for small left-over register tiles at the cache tile boundary. The number of conflict misses in the L1 and L2 caches can also vary with tile size. It is difficult to model conflict misses analytically. Therefore, to address these problems, we advocate using local search around the model-predicted value, as discussed in detail in Section 4.2.1.
For completeness, we show the sensitivity of performance to MU and NU in Figure 9 (d), although there is nothing interesting in this graph. Because of the extremely large number of registers on this platform (NR = 128), the peak of the hill is more like a plateau with a multitude of good choices for the MU and NU unroll factors. Both Model and Global Search do well.
Summary
Our experiments point to a number of deficiencies with the 
CLOSING THE GAP
We now show how the performance gaps discussed in Section 3 can be eliminated by a combination of model refinement and local search. We discuss model refinement in Section 4.1, and local search in Section 4.2.
Model refinement

Refining model for MU and NU
Recall that Inequality (4) for estimating values for MU and NU , reproduced below for convenience, is based on an allocation of registers for a MU × NU sub-matrix (c) of C, a MU × 1 vector (ā) of A, a 1 × NU vector (b) of B, and Ls temporary values.
MU × NU + MU + NU + Ls ≤ NR
Allocating the complete vectorsā andb in registers is justified by the reuse of their elements in computing the outer productc =c +ā ×b. During this procedure each element ofā is accessed NU times and each element ofb is accessed MU times.
However, these arguments implicitly make the following assumptions.
1. MU > 1 and NU > 1: if MU = 1 each element ofb is accessed only one time and therefore there is not enough reuse to justify allocatingb to registers. By analogy the same holds for NU andā respectively.
2. The ISA is three-address: it is assumed that an element ofā and an element ofb can be multiplied and the result stored in a third location, without overwriting any of the input operands, as they need to be reused in other multiplications. This is impossible on a twoaddress code ISA.
Both these assumptions are violated on the AMD Opteron. The Opteron ISA is based on the x86 ISA and is therefore two-address. Moreover, the Opteron has only 8 registers and no fused multiply-add instruction, so the model in Figure 6 estimates that the optimal value of NU is 1. Therefore, there is no reuse of elements ofā. Not surprisingly, the original model does not perform well on this architecture.
In fact, Table 12 (a) shows that the ATLAS global search chose (MU , NU ) = (6, 1), which provides the best performance as can be seen in Figure 7 (c), although it violates Inequality (4). Additionally, global search chose F M A = 1 even though there is no fused multiply-add instruction! To understand why code produced with these parameters achieves high performance, we examined the machine code output by the C compiler. A stylized version of this code is shown in Figure 10 . It uses 1 register forb (rb), 6 registers for c (rc1 . . . rc6) and 1 temporary register (rt) to hold elements ofā.
One might expect that this code will not perform well because there are dependencies between back-to-back adjacent instructions and a single temporary register rt is used. The reason why this code performs well in practice is because the Opteron is an out-of-order issue architecture with register renaming. Therefore it is possible to schedule several multiplications in successive cycles without waiting for the first one to complete, and the the single temporary register rt is renamed to a different physical register for each pair of multiply-add instructions. To verify these conclusions, we used a different mode of floating-point operations on the Opteron in which it uses the 16 SSE vector registers to hold scalar values. In this case, the model predicts (MU , NU ) = (3, 3). However, the ISA is still two-address code and experiments show that better performance can be achieved by (MU , NU ) = (14, 1) [15] .
We conclude that when an architecture has a small number of registers or two-address code ISA, but implements out-of-order issue with register renaming, it is better to leave instruction scheduling to the processor and use the available registers to allocate a larger register tile. These insights permit us to refine the original model for such architectures as follows: NU = 1, MU = NR − 2, F M A = 1.
To finish this story, it is interesting to analyze how the AT-LAS search engine settled on these parameter values. Note that on a processor that does not have a fused multiplyadd instruction, F M A = 1 is equivalent to F M A = 0 and Ls = 1. The code produced by the ATLAS Code Generator is shown schematically in Figure 11 . Note that this code uses 6 registers forā (ra1 . . . ra6), 1 register forb (rb), 6 registers forc (rc1 . . . rc6) and 1 temporary register (implicitly by the multiply-add statement). However, the back-end compiler (GCC) reorganizes this code into the code pattern shown in Figure 10 . Notice that the ATLAS Code Generator itself is not aware that the code of Figure 10 is optimal. However, setting F M A = 1 (even though there is no fused-multiply instruction) produces code that triggers the right instruction reorganization heuristics inside GCC, and performs well on the Opteron. This illustrates the well-known point that search does not need to be intelligent to do the right thing! Nevertheless, our refined model explains the observed performance data, makes intuitive sense, and can be easily incorporated into a compiler.
Although there is a large body of existing work on register allocation and instruction scheduling for pipelined machines [7, 10] , we are not aware of prior work that has highlighted this peculiar interaction between compile-time scheduling, register allocation, dynamic register-renaming, and out-of-order execution.
Refining NB
We made a refinement to the model for NB to correct for the interaction between register tiling and cache tiling as follows. When computing the optimal NB, we consider horizontal panels of width MU and vertical panels of width NU instead of rows and columns from matrix A and matrix B respectively. This leads to the following Inequality (5) .
Finally, to avoid fractional register tiles, we trim the value of NB obtained from Inequality (5) so that it is a multiple of MU and NU . We now describe how we tile for the right cache level. The models presented in Section 2.3 do not account for cache miss penalties at different cache levels, so although we estimate tile sizes for different cache levels, we cannot determine which level to tile for.
One approach to addressing this problem in the context of model-driven optimization is to refine the model to include miss penalties. Our experience however is that it is difficult to use micro-benchmarks to measure miss penalties accurately for lower levels of the memory hierarchy on modern machines. Therefore, we decided to estimate tile sizes for all the cache levels according to Inequality (5), and then empirically determine which one gives the best performance.
Notice that in the context of global search, the problem can be addressed by making the search space for NB large enough. However, this would increase the search time substantially since the size of an L3 cache, which would be used to bound the search space, is typically much larger than the size of an L1 cache. This difficulty highlights the advantage of our approach of using model-driven optimization together with a small amount of search -we can tackle multi-level memory hierarchies without increasing installation time significantly.
Refining Ls
Ls is the optimization parameter that represents the skew factor the ATLAS Code Generator uses when scheduling dependent multiplication and addition operations for the CPU pipeline. Although the ATLAS documentation incorrectly states that Ls is the length of the floating point pipeline [16] , the value is determined empirically in a problem dependent way, which allows global search (and the original model) to achieve good performance. In this section we present our analytical model for Ls, which relies only on hardware parameters.
Studying the description of the scheduling in Section 2.1, we see that the schedule effectively executes Ls independent multiplications and Ls − 1 independent additions between a multiplication muli and the corresponding addition addi. The hope is that these 2Ls − 1 independent instructions will hide the latency of the multiplication. If the floating-point units are fully pipelined and the latency of multiplication is L h , we get the following inequality, which can be solved to obtain a value for Ls.
On some machines, there are multiple floating-point units.
If |ALUF P | is the number of floating-point ALUs, Inequality (6) gets refined as follows.
Solving Inequality (7) for Ls, we obtain Inequality (8) .
Of the machines in our study, only the Intel Pentium machines have floating-point units that are not fully pipelined; in particular, multiplications can be issued only once every 2 cycles. Nevertheless, this does not introduce any error in our model because ATLAS does not schedule back-to-back multiply instructions, but intermixes them with additions. Therefore, Inequality (6) holds.
Local search
In this section, we describe how local search can be used to improve the NB, MU , NU , and Ls optimization parameters chosen by the model.
Local Search for NB
If NB M is the value of NB estimated by the model, we can refine this value by local search in the interval [NB M − lcm(MU , NU ), NB M + lcm(MU , NU )]. This ensures that we examine the first values of NB in the neighborhood of NB M that are divisible by both MU and NU .
Local Search for MU , NU , and Ls
Unlike sensitivity graphs for NB, sensitivity graphs for MU and NU tend to be convex in the neighborhood of modelpredicted values. This is probably because register allocation is under compiler control, and there are no conflict misses. Therefore, we use a simple hill-climbing search strategy to improve these parameters.
We start with the model predicted values for MU , NU , and Ls and determine if performance improves by changing each of them by +1 and −1. We continue following the path of increasing performance until we stop at a local maximum. On platforms on which there is a Fused-Multiply-Add instruction (F M A = 1), the optimization parameter Ls has no effect on the generated code and in that case we only consider MU and NU for the hill-climbing local search.
EXPERIMENTAL RESULTS
We evaluated the following six approaches on a large number of modern plarforms, including DEC Alpha 21264, IBM 4. Multi-level Local Search: This approach is the same as Local Search, but it considers tiling for lower levels of the memory hierarchy as described in Section 4.1.2.
5. Global Search: This is the ATLAS search strategy.
6. Unleashed: This is the full ATLAS distribution that includes user-contributed code, installed by accepting all defaults that the ATLAS team have provided. Although this is not directly relevant to our study, we include it anyway for completeness.
Global Search uses the micro-benchmarks in the standard ATLAS distribution to determine the necessary machine parameters. For all other approaches we used X-Ray [19] , which is a tool implemented by us for accurately measuring machine parameters. Figure 12 shows the results for AMD Opteron 240. For native BLAS we used ACML 2.0. The MMM performance achieved by Model + Local Search is only marginally worse than that of Global Search. Additional analysis showed that 
AMD Opteron 240
N B M U , N U , K U Ls FMA F F , I F , N F MFLOPS
Figure 13: SUN UltraSPARC IIIi
this is due to a slightly suboptimal value of NB. Had we extended the interval in which we do local NB search by a small amount, we would have found the optimal value. We conclude that the model refinement and local search described in Section 4.1 are sufficient to address the performance problems with the basic model described in Section 3. Figure 13 shows the results for SUN UltraSPARC IIIi. We used the native BLAS library included in Sun One Studio 9.0. Model performs marginally better than Global Search because the ATLAS micro-benchmarks estimated that the L1 data cache size is 16KB, rather than 64 KB. This restricted the NB interval examined by Global Search, leading to a suboptimal value of NBand lower performance. MultiLevel Local Search performs better than Local Search because it finds that it is better to tile for the L2 cache rather than for the L1. Figure 14 shows the results for Intel Itanium 2. Native BLAS used is MKL 6.1. Model does not perform well because it tiles for the L1 data cache. ATLAS global search selected the maximum value in its search range (NB = 80). Nevertheless this tile size is not optimal either. The Multilevel model determined that tiling for the 3 MB L3 cache is optimal, and chooses a value of NB = 362. This is refined to NB = 360 by local search. This improves performance compared to both Model and Global Search.
SUN UltraSPARC IIIi
Intel Itanium 2
SGI R12000
Finally, we include some interesting results for the SGI R12K. Figure 15 shows the results for this machine. For native BLAS we used SGI SCSL v.1.4.1.3. The most interesting fact on this platform is that Multi-Level Local Search successfully finds that it is worth tiling for the L2 cache. By doing this, it achieves better performance than even the native BLAS. Global Search achieves slightly bet- than Model and Global Search. Although not entirely visible from the plot, on this platform, the native compiler (SGI MIPSPro) does a relatively good job.
CONCLUSIONS AND FUTURE WORK
The compiler community has invested considerable effort in inventing program optimization strategies which can produce high-quality code from high-level programs, and which can scale to large programs and complex architectures. In spite of this, current compilers produce very poor code even for a simple kernel like matrix multiplication. To make progress in this area, we believe it is necessary to perform detailed case studies.
This paper reports the results of one such case study. In previous work, we have shown that model-driven optimization can produce BLAS codes with performance within 10-20% of that of code produced by empirical optimization. In this paper, we have shown that this remaining performance gap can be eliminated by a combination of model refinement and local search, without increasing search time substantially. The model refinement (i) corrected the computation of the register tile parameters (MU , NU ) for machines which have either a 2-address ISA or relatively few logical registers, and (ii) adjusted the value of the cache tile parameter (NB) to take interactions with register tiles into account. The local search allowed us to take L1 conflict misses into account, and to open up the possibility of tiling for lower levels of the memory hierarchy. On some machines, this strategy outperformed ATLAS Global Search and the native BLAS.
We believe that this combination of model refinement and local search is promising, and it is the corner-stone of a system we are building for generating dense numerical linear algebra libraries that are optimized for many levels of the memory hierarchy, a problem for which global search is not tractable. We also believe that this approach is the most promising one for incorporation into general-purpose compilers.
