Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However, tuning the BLAS in isolation misses opportunities for memory optimization that result from composing multiple subprograms. Because it is not practical to create a library of all BLAS combinations, we have developed a domain-specific compiler that generates them on demand. In this paper, we describe a novel algorithm for compiling linear algebra kernels and searching for the best combination of optimization choices. We also present a new hybrid analytic/empirical method for quickly evaluating the profitability of each optimization. We report experimental results showing speedups of up to 130% relative to the GotoBLAS on an AMD Opteron and up to 137% relative to MKL on an Intel Core 2.
INTRODUCTION
The performance of many scientific applications and linear algebra kernels is limited by memory bandwidth [24] , a situation that is likely to continue for the foreseeable future [36] . Computer scientists apply tuning techniques to improve data locality and create highly efficient implementations of the Basic Linear Algebra Subprograms (BLAS) [5, 18, 23, 28, 49] and LAPACK [6] , enabling scientists to build high-performance software at reduced cost. While tuned libraries for the level 3 BLAS and LAPACK routines perform at or near machine peak, level 1 and 2 BLAS routines, in which there is less data reuse, achieve only a fraction of peak [27] . However, sequences of level 1 and 2 BLAS routines appear in many scientific applications and these sequences represent further opportunities for tuning. In particular, fusing the loops of successive BLAS routines reduces memory traffic and increases performance [8, 27] . Four such combined routines have been added Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC '09 Portland, Oregon USA Copyright 2009 ACM 978-1-60558-744-8/09/11 ...$10.00.
to the BLAS standard [13] . Many more combined routines are needed, but adding every possible combination to the BLAS standard is not feasible.
To automate the creation of combined routines for level 1 and level 2 operations, the authors have developed a compiler, named Build to Order BLAS (BTO), whose input is a sequence of statements of matrix and vector arithmetic in annotated MATLAB and whose output is a tuned implementation of that sequence in C++.
(Generating C or Fortran instead would be a simple change because we use very few features specific to C++.) Our initial prototype applied loop fusion at every opportunity, regardless of profitability [47] . In this paper we present a more refined approach in which the compiler enumerates loop fusion decisions but uses a combination of analytic and empirical techniques to identify the most promising choices.
In particular, the contributions of this paper are as follows:
1. We present an algorithm that compiles linear algebra specifications into loops and enumerates the optimization choices arising from two variants of loop fusion (Section 3).
2. We present a hybrid analytic/empirical performance evaluation method that finds the best combination of optimization decisions in less than two minutes (Section 4).
3. We report experimental results showing speedups of up to 130% relative to the GotoBLAS [22] on an AMD Opteron and up to 137% relative to the Intel Math Kernel Library (MKL) [28] on an Intel Core 2 (Section 5).
The rest of the paper starts with some background and related work in Section 2. It then discusses the above contributions in Sections 3, 4, and 5. The paper ends with conclusions and plans for future work in Section 6.
BACKGROUND AND RELATED WORK
We start with a review of loop fusion, which we apply to obtain memory efficiency, then review the literature in domain-specific compilers, auto-tuning libraries, loop restructuring compilers, and methods for analyzing the profitability of optimizations.
Loop Fusion The transformation of one or more loops into a single loop is called loop fusion. If two loops reference the same matrices or vectors, the temporal locality of those references can be improved by performing loop fusion. When applied to memorybound linear algebra computations, loop fusion can lead to significant runtime speedups [27] . To be a candidate for loop fusion, the loops must have compatible iterations. For example, they can both iterate over the column dimension or both iterate over the row dimension of a matrix. We illustrate loop fusion in Figure 1 , applying it to the GEMVER kernel of the updated BLAS [13] . Assuming that a column of matrix A remains in cache throughout an iteration of the loop, the fused implementation only reads and writes A once from main memory. The benefit comes from not having to make four calls to BLAS routines where each would read in the matrix from memory and two would also write A to memory.
While it is often beneficial to fuse loops, it is not always so. Suppose that there is only enough room in cache for two arrays of length m. Then it is not profitable to fuse the first two scaled vector additions on the right hand side of Figure 1 because doing so brings three arrays (u1, u2, and A(:, j)) through the memory hierarchy at the same time, causing some of A(:, j) to be evicted. Thus, an optimizing compiler needs to account for the computer architecture as well as matrix order, storage format, and matrix sparsity.
Domain-specific Compilers The MaJIC and FALCON compilers for MATLAB optimize matrix expressions according to algebraic laws [34, 43] , but they do not perform loop fusion. Several linear algebra specific compilers take a loop nest specification written over a dense matrix and produce implementations of a sparse matrix operations [10, 38] . Our work differs in that the input is matrix algebra (instead of loops), and we optimize sequences of operations for memory efficiency. The telescoping languages project [29] analyzes MATLAB scripts and optimizes them using procedure specialization, strength reduction, and vectorization. Similarly, Broadway [26] optimizes calls to libraries such as PLA-PACK [4] . Our work differs in that we generate the linear algebra routines instead of making calls to pre-existing libraries.
Considerable research has gone into optimizing arrays in languages such as HPF [30] , APL [15] , and many others. Our compiler differs in that it applies domain-specific knowledge of linear algebra to handle a wide range of matrix storage formats. Domainspecific compilers are growing in popularity with diverse applications in fields such as DSP transforms [39] , tensors [3] and optimizing user-defined abstractions [41] .
High-Performance Libraries and Auto-tuning Active libraries are libraries that optimize themselves [17] . One example is AT-LAS, which uses empirical guided iterative compilation to generate linear algebra operations such as matrix multiplication [49, 50] . We instead optimize sequences of matrix operations and focus on level 2 operations. Blitz++ [48] , the Matrix Template Library [46] , and Bernoulli [2] use expression templates to perform loop fusion inside individual statements but not across statements. The TaskGraph library uses a dataflow representation to apply loop fusion across statements at run-time [45] . Our approach fuses loops statically and separates the compiler into a graph rewriting engine and linear algebra database for extensibility.
Hierarchically Tiled Arrays (HTA) [11] provide a convenient abstraction for writing tiled algorithms. The Formal Linear Algebra Methods Environment (FLAME) [9, 25] partially automates the generation and implementation of correct linear algebra algorithms. Our compiler is fully automatic.
Loop Restructuring Compilers
There is a long tradition of optimizing compilers for general purpose languages that restructure loops to improve data locality. The major approaches in this area are the unimodular [51] , affine [32] , and polyhedral [14] frameworks. All of them rely on dependence analysis [31] to determine when a transformation is legal. This works well for regular arrays but not sparse matrices. The input to our compiler is in terms of matrix operations instead of loops, so dependence analysis is not needed to determine the legality of our transformations.
Analytic Methods for Profitability Analysis Ferrante et al. [20] estimate the number of cache lines accessed in a perfectly nested loop using a capacity-based model. Our model also relies on capacity but handles arbitrary sequences of loop nests and takes reuse distances into account. Ghosh et al. [21] formulate equations for reuse distances and cache misses that provide a high-degree of accuracy. However, their model is expensive to evaluate. Rivera and Tseng [42] develop a conflict model to predict the profitability of array padding. Our model does not include conflict misses to reduce model evaluation time; we rely on empirical testing for that level of detail. Yotov et al. [53] develop an analytic model for matrix multiplication and show that analytic models can compete with empirical testing. Agakov et al. [1] apply machine learning in predicting the profitability of register allocation choices and instruction scheduling.
Empirical Methods for Profitability Analysis PHiPAC [12] and ATLAS [49, 50] pioneered the use of empirical search to evaluate the profitability of optimizations in the setting of auto-tuned libraries. Zhao et al. [55] use exhaustive search and empirical testing to select the best combination of loop fusion decisions. Yi and Qasem [52] apply empirical search to determine the profitability of optimizations for register reuse, SSE vectorization, strength reduction, loop unrolling, and prefetching. Their framework is parameterized with respect to the search algorithm and includes numerous search strategies. Pouchet et al. [37] use decoupling heuristics and a genetic algorithm in the setting of a polyhedral model to search for high-performance loop nestings. In this paper we use analytic models to quickly discard unprofitable choices then use empirical testing to narrow in on the best.
Hybrid Methods for Profitability Analysis Recently a number of researchers have explored hybrid methods that use a combination of analytic modeling and empirical testing. Chen et al. [16] use analytic methods to select a small number of optimization variants, applying combinations of loop permutation, unrolling, register and loop tiling, copy optimization, and prefetching but not loop fusion. They use empirical testing to select the best variant, typically in three to eight minutes. Epshteyn et al. [19] consider loop tiling decisions in the context of matrix multiplication and use an explanation-based learning algorithm to adapt their analytic model based on empirical results.
Qasem [40] uses pattern-based direct search to find good combinations of loop fusion decisions. Because Qasem is targeting a general purpose compiler, he prefers the scalability of direct search over exhaustive search even though direct search sometimes misses the globally optimal solution. Yotov et al. [54] advocate using analytic methods to make rough, global decisions and then use empirical search locally to fine-tune performance.
In this paper we consider loop fusion, which is particularly challenging because fusion decisions interact with one another. To avoid local minima, we use exhaustive search combined with an- alytic methods to quickly discard unprofitable combinations, then use empirical testing to select the best. Despite using exhaustive search, we generate high-performance code in less than two minutes (see Section 5) . While exhaustive search may not be realistic for general purpose compilers, our results show that it is effective in the more restricted setting of linear algebra kernels.
THE COMPILATION FRAMEWORK
The BTO compiler generates a high-performance implementation from a kernel specification. In particular, the input consists of declarations for the input and output parameters followed by a sequence of statements written in MATLAB syntax. An example kernel specification for GEMVER is shown in Listing 1. The three major phases of the compilation process are input processing, optimization, and performance analysis (see Figure 2 ). We discuss input processing and optimization in this section and discuss performance analysis in Section 4.
The compiler parses the kernel specification into a dataflow graph; the graph for GEMVER is shown in Figure 3 . Initially, each node in the graph represents an input or output parameter or an operation. Edges between nodes represent the flow of data. After parsing, the compiler performs type inference to determine the dimensionality and best traversal pattern for each node.
Type Inference
The compiler assigns a type to each node in the graph, where types are given by the following grammar:
The type S is for scalars, such as double-precision floating point numbers. A type of the form O<τ > describes a container with element type τ and orientation O. The orientation is either C for column or R for row. Orientation plays two roles: it describes the shape of the node, e.g., C<S> is a column vector, and it describes the preferred traversal patterns which are chosen to correspond to physical memory layout. For example, C<R<S>> describes a matrix whose rows are stored in contiguous memory (as in the language C) whereas R<C<S>> describes a matrix whose columns are stored in contiguous memory (as in Fortran). We define the following transpose operator on orientations: R T = C, C T = R. The type inference phase assigns a type to each node and at the same time chooses how to implement each operation node. The type inference is data driven, informed by a linear algebra knowledge base of which several rows are shown in Table 1 . There is one row for each algorithm that implements a given operation. An algorithm is a valid implementation for an operation node in the graph if 1) the algorithm's operator (e.g., + or ×) matches the operation label on the node, 2) the operand types match the types of the operands, and 3) the result type matches the type of the node. If no algorithm can be inferred for an operation node, the compiler reports an error.
The notation we use for result types deserves some explanation. The notation includes the use of the + and × operators within the type. What this means is that the type inference algorithm is applied recursively to obtain the result type. For example, consider the add algorithm whose result type is specified as O<τ l + τr>. Suppose the operands of a node labeled with the operation + both have type C<S>. To compute the result type we recursively compute the result type for S + S. The only algorithm that applies is s-add, so the inner result type is S and therefore the outer result type is C<S>.
Consider the type that would be inferred for the node corresponding to u1 * v1' in the GEMVER kernel of Listing 1. Because matrix A has type R<C<S>>, the node for u1 * v1' must also have type R<C<S>>. In this case, the only applicable algorithm is add and that algorithm requires that the orientations of the two operands match. The compiler therefore chooses the outer2 algorithm for u1 * v1' because that version of outer product has a result type that matches R<C<S>>.
As another example, consider the node corresponding to A' * y in Listing 1. The matrix A' has type C<R<S>> and the vector y has type C<S>. Thus, the algorithm cc-mult is a match for this multiplication node and the result type is
Multiple algorithms may be valid choices for the same node. For example, the algorithms outer1 and outer2 may sometimes be valid choices for the same node. In such cases, our compiler makes an arbitrary choice. As future work, we plan to evaluate the performance of each option and choose the best.
Refinement and Optimization
The primary optimization used in the compiler is loop fusion, so as the compiler generates loops, it also chooses which loops to Algo Op and Operands Result Type Pipe add fuse. However, because of complex global interactions, the profitability of a particular fusion decision cannot be made in isolation, but rather must be made in the context of all the other decisions. For example, consider a sequence of three loops that are eligible for fusion. The decision to fuse the first two loops cannot be made without considering the third loop because fusing the second and third loops might result in better locality than fusing the first and second. Fusing all three loops, on the other hand, might result in lower performance by increasing register pressure or causing reused data to fall out of cache. To find the optimal set of choices, the BTO compiler quickly explores many combinations of fusion decisions. The BTO compiler carries out implementation and optimization decisions by applying graph transformations to the dataflow graph. A graph transformation consists of a pattern that specifies to what kind of subgraphs the transformation applies and a rewrite rule that specifies what nodes and edges should be added or removed from the graph. The graph transformations we use come in two varieties, refinements and optimizations. A refinement makes an implementation choice by expanding higher-level operations into lower-level operations, such as expanding a vector operation into a loop over scalar operations. The graph transformations for these refinements are stored in the linear algebra knowledge base. The optimizing transformations, on the other hand, replace subgraphs with functionally-equivalent subgraphs that may provide better performance. The optimizing transformations are stored in a separate knowledge base.
The REFINE and OPTIMIZE algorithms are shown in Figure 4 . The REFINE algorithm carries out the implementation choices that were made during type inference. It iterates through the graph, applying the graph transformation associated with the chosen algorithm. When graph transformations add nodes to the graph, they give new nodes larger indices than existing nodes. Thus, as nodes are added to the graph, they too are eligible for further refinement. We discuss particular refinements in Section 3.2.1.
The OPTIMIZE algorithm takes as input the graph produced by the REFINE algorithm and then explores optimization choices. We discuss particular optimizations in Section 3.2.2. The OPTIMIZE algorithm explores choices in a depth-first manner by maintaining a stack of tuples that represent work items. Each tuple contains a version of the dataflow graph and the current node.
If the current node is in the graph, then we push that graph back onto the stack with the node incremented by one to represent the decision not to optimize this node. The algorithm then searches for an applicable optimization and applies the optimization to a copy of the current graph. The transformed graph is also pushed onto the stack, with the node incremented by one. When we have finished making decisions for all the nodes in a graph, then the algorithm calls ADD-TO-SEARCH-SPACE, shown Figure 7 , which decides whether the graph should be added to the best_versions list. ADD-TO-SEARCH-SPACE is described in Section 3.2.3.
The OPTIMIZE algorithm applies one optimization to each node, in the order in which the nodes are created. This ordering explores most combinations but is not exhaustive in some situations. We are currently working on changes to make the algorithm exhaustive.
Graph Refinement
Refinement steps are responsible for reducing high-level matrix and vector operations into loops and scalar operations. Refinement steps introduce new nodes to represent loads, stores, and reduction operations, and they introduce a special kind of subgraph to represent a generalized form of loop over independent operations. Depending on the architecture and the nesting of the loops, a generalized loop can be translated into a parallel loop, a sequential loop, or even a vector instruction.
Generalized loops are created based on the information in Table  1 in the column titled Result Type. For example, in the add row, the result type is O<τ l +τr>. A loop is generated for each container type (one in this case). In the implementation, we associate extra information with each container type, such as its size and storage format, which is needed to generate the loop. The body of the loop is informed by the container's element type, in this case τ l + τr, which means that the body of the loop adds the elements of the left and right operands. To access elements, we insert nodes to handle load and store operations. Figure 5 shows an example of applying two refinements to a matrix addition according to the add algorithm. Each refinement adds two load nodes, one store node, and a subgraph (surrounded in dotted lines) to represent a loop.
Optimization
The two methods of loop fusion in use by the compiler are shown in Figure 6 . Figure 6(a) shows the graph transformation for the case when two independent loops access the same data. This transfor- (This is known as array contraction [7] and minimizing array materializations [44] .) The pipelining optimization is applicable to operation nodes whose Pipe column entry is "yes" in the linear algebra knowledge base (see Table 1 ). The BTO compiler does not yet perform loop tiling, and therefore does not achieve high performance for matrix multiplication or other level 3 BLAS operations. Achieving high-performance on level 3 operations is not a high priority in our work because the level 3 operations are handled quite well by others [23, 49] ; there is little benefit from combining sequences of operations when the operations, like matrix multiplication, have a high ratio of computation to memory access. That said, we plan to add loop tiling because it can achieve modest performance improvements for some level 1 and level 2 operations.
Search Space Pruning
With our current set of optimizations, the compiler generates between 1 and 648 optimization combinations for each kernel in our suite of benchmarks. Empirically testing 648 versions is possible but, as we add more optimizations, this number will grow quickly, resulting in a large search space. We therefore include an analytic evaluation step that is accurate enough to remove the need for empirically testing many of the versions in the search space. The algorithm for pruning the search space is shown in Figure 7 . We convert the graph to an abstract syntax tree (for convenience) and then pass the tree to the cost estimator described in Section 4. If its cost is lower than the most costly of the best versions so far, we replace the most costly version with the new version. The size to which the set of best versions is allowed to grow is specified by the user.
Translation to C++
After the analytic model has pruned the search space, we gen- erate code for the best candidates and empirically evaluate their performance. The BTO compiler generates C++ and then compiles the C++ to machine code using the platform's native compiler. The only non-C feature of C++ that we use is references, so it would be a simple matter to retarget the code generator to C or Fortran. The BTO compiler currently does not attempt any lower-level optimizations such as software pipelining or vectorization but relies instead on the native compiler to do so. For computation-bound kernels, or kernels without opportunities for loop fusion, this omission results in sub-optimal performance. In future work, we plan to also include lower-level optimizations in the compiler.
To translate a dataflow graph to C++, the compiler generates one loop for each subgraph and generates the obvious expressions to carry out the scalar operations within each loop. The order in which the loops, and scalar operations within the loops, appear in the output is determined by topological sorting the dataflow graph.
ANALYTIC PREDICTION
To efficiently differentiate between optimization choices, we designed and implemented an analytic model. Section 4.1 describes how we predict the amount of data accessed from each memory structure in a machine. We validate these predictions by presenting measurements from hardware performance counters, hand checking the calculations, and instrumenting the generated code to record memory accesses. To estimate overall performance, the analytic model converts data access predictions into execution time as described in Section 4.2. We validate the predicted performance by comparing to the actual performance.
Predicting Data Accesses
To increase the speed of our model, we restrict it to calculating only the most distinguishing factors. We assume that memory structures are fully associative and have a line size equal to the word size used in the calculation. The model assumes a consecutive access pattern because that is what our compiler produces. The model treats TLBs and caches identically, with the TLB having a size equal to its number of entries times page size. We do not model latency, cache line size, or cache associativity as these factors do not have a significant impact on the relative performance predictions. In general, we assume a "warm cache": if the working set is smaller than cache, we count all memory accesses as cache hits. For larger working sets, we do not count the first miss that brings a datum into cache in situations where that datum will be reused O(n) times or more, where n is the size of a vector or matrix.
Towards defining equations for the number of accesses to each memory structure, we establish several auxiliary notions. Let x range over memory structures in a machine. We write prev (x) for the next smaller memory structure than x. The function prev is undefined (equal to ⊥) for the smallest memory structure (typically L1 cache). The size of a memory structure x is written size(x).
To represent all memory accesses within a loop L, we use a multiset of addresses R(L). Each address occurs once in R(L) for each time it is accessed in loop L (not including subloops). Let d range over memory addresses. We write R(L)(d) for the number of occurrences of d in R(L).
We write L1 ≤ L2 when L1 is either the same loop as L2 or nested somewhere within L2. For a loop L, the working set of the loop, written WS (L), is the number of unique data accesses in the loop (including sub-loops).
The reuse distance of data element d in loop L, written RD(d, L), is the number of unique data accesses between two accesses to d during the execution of loop L. Now we define the hits to a memory structure x in loop L, written H(x, L), and we define the accesses to x in loop L, written A(x, L). These two multisets are mutually recursive but the base case of A does not rely on H.
The number of accesses to memory structure x in loop L (not counting sub-loops) is |A(x, L)|.
Implementation
The model takes as input an abstract syntax tree and a machine. The first step in using the model is to convert the compiler's dataflow graph to an abstract syntax tree. The tree contains three types of nodes: loops, statements, variables. Each node for a loop contains information about the variables on which it iterates, how many times the loop is run and pointers to all variables accessed within the loop. Each statement node contains all the variables that it references. A variable node includes the variable's name, the number of different iterates that are used to access the variable (if the variable is an array) and the iterates' names. The top node in an abstract syntax tree is always one that does not iterate on any variables. It performs a single iteration to allow the combination of multiple independent loops for analysis.
A machine is represented in a structure that contains its name, the number of memory structures it contains, and pointers to those memory structures. Each memory structure contains the amount of data it can hold or address, and a bandwidth, all expressed in bytes.
For each memory structure except the smallest, we perform a depth-first traversal of the abstract syntax tree, evaluating Equation 1 until a loop with a working set smaller than the memory structure is found. Then accesses to the next larger memory structure are computed using Equation 3 beginning with the next higher loop in the tree until the root is reached. Accesses are stored per variable with each instance of a variable having a distinct miss count.
In the implementation of the model, we trade accuracy for speed. We only calculate accesses, hits, and reuse distances at the first element of each array, which forfeits some accuracy as reuse distances sometimes change throughout arrays. This simplification only affects our predictions for data sets near cache boundaries.
Evaluation
To ensure the correctness and accuracy of the implementation, we first compare the accesses computed by the implementation of the analytic model to those resulting from a by-hand calculation of Equation 3. The greatest possible difference between the two is twice the maximum number of variables in a statement (an array only counts once) times the wordsize of a data element. This difference results from the model's not enforcing the ordering of variables within a statement. It is small compared to both cache size and reuse distance. In contrast, the prediction of the points at which matrices and vectors no longer fit into a memory structure are in exact agreement.
Comparing the reuse distances calculated by the model to reuse distances recorded by instrumenting code shows that the model predicts reuse to within the same constant as the constant for memory accesses, described above.
Finally, we compare our model's predictions of memory reads to the actual number of reads measured by hardware performance counters. For each memory structure and loop, we divide the number of accesses by the memory structure's line or page size, written LS(x), to obtain the number of cache lines or page table walks needed for that memory structure, written LA(x):
(4) Figure 8 shows the predicted and actual memory miss results for the ATAX kernel. (Misses to a memory structure are equivalent to accesses from the next larger structure.) A separate graph is used for each memory structure (See Table 3 .) The predicted misses for the L1 and L2 caches are accurate to within 1% except near cache boundaries. In those cases, conflict misses play an important role and expose the difference between the set associativity of the actual caches and the full associativity assumed by the model.
The TLB is fully associative so it does not experience conflict misses. The predicted misses for the TLB are accurate to within 10% for large matrices on the Intel Core 2 and the AMD Opteron. 
Kernel
Operation 
Cost Function
To convert memory predictions to a single value for comparing optimization choices we define a cost function. The cost function takes as input the data accesses computed by our model (Section 4.1.1) and the bandwidth, B(x), between each memory structure x and the CPU. We obtain the bandwidths using TRIAD from the Stream benchmark [33] .
If L is an inner loop, the cost is computed as follows.
We use the largest value of A(x, L)/B(x) because that represents the bottleneck that limits performance. We write child (L) for each of the loops directly nested within loop L. If L is an outer loop, the cost is computed as follows
When applied recursively to the top of the abstract syntax tree, Equation 6 produces a runtime cost estimate. In Figures 9 and 10 , we compare the runtime estimates from the cost function with actual runtimes of the compiler-generated codes on the Intel Core 2. Each graph in Figure 9 shows the results for the two versions of ATAX produced by the compiler. In each figure, the lines are numbered according to how much fusion was applied to that version of the kernel, with number 1 corresponding to no fusion. Figure 9 shows that our predictions for large matrices and vectors are accurate: predicted and actual performances are nearly identical. For smaller matrices, our predictions are less accurate for two reasons. First, the model implementation does not currently take the L1 cache into account, so when L1 bandwidth is the bottleneck, the predictions are off. Also, the model does not take the cost of arithmetic or other computations into account, so the model over-predicts performance when computation is the bottleneck. At cache boundaries, our predictions abruptly change while the compiler produced versions follow smooth curves. The abrupt changes happen because the cost model uses the memory model's predictions. Therefore, jumps in performance in Figure 9 correspond to the the jumps in memory predictions in Figure 8 . Figure 10 compares our predictions to actual performance for the 648 versions of GEMVER produced by the BTO compiler, at matrix order 3000. It shows that as the actual performance increases, the predicted performance, for the most part, increases as well. The model over-predicts performance in all cases, though this inaccuracy is less important than the performance difference between versions. When kernels require temporary storage, the first write to a temporary array produces ten times the expected TLB misses. If we replace the TLB predictions with the actual number of TLB misses, the resulting costs are extremely accurate.
The last observation from Figures 9 and 10 is that our cost function always finds the best kernel. All of the findings on the Intel Core 2 hold on the Opteron as well, except for the GEMVER kernel, where the best version is ranked second by the cost function.
EXPERIMENTAL EVALUATION
In this section, we present several performance studies. We first describe our test environment and then present the performance of the code generated by our compiler when using empirical testing for the entire search space. We then show that pruning the space using the analytic model drastically reduces compile time without significantly decreasing the performance of the generated code.
Test Environment and Methodology
We evaluated the compiler on the eleven kernels listed in Table  2 and on two computers, the Intel Core 2 and AMD Opteron listed in Table 3 . The selection of kernels tells a complete story regarding the performance of our compiler. First, we selected several kernels with opportunities for loop fusion to demonstrate situations in which our compiler shines. Of these, the BiCGK kernel appears in the biconjugate gradient method, ATAX in the computation of normal equations, GEMVER, DGEMVT, AXPYDOT, and WAXPBY appear in the updated BLAS [13] , and VADD and MADD are kernels that we created. Next, we selected GESUMMV from the updated BLAS to show a kernel with some opportunities for fusion, but where fusion does not make big impact. Lastly, we selected the DGEMV kernel from the BLAS, in which there is no opportunity for loop fusion.
We compiled the C++ code generated by our compiler using Intel's icc compiler with -O3 and vectorization options. We ran the experiments on matrix orders ranging from 100 to 10,000 at intervals of 100 and on vector dimensions ranging from 10,000 to 1,000,000 at intervals of 10,000. We compared our compiler's performance to several BLAS implementations: GotoBLAS [22] , Netlib [35] , Intel's MKL [28] , and AMD's ACML [5] .
For this comparison, we implement each kernel as a sequence of calls to BLAS. For example, to implement GEMVER, we make two calls to DGER, two calls to DCOPY, and two calls to DGEMV. All tests are run a minimum of five times to provide statistically sound numbers. The values reported in tables and graphs are averages over all trials. We compute standard deviations, but, in most cases, they are less than 2%, so we do not include error bars in the graphs. The only exceptions are for experiments with working sets smaller than cache.
Performance of Generated Code
The results in this section show that the BTO compiler achieves higher performance than BLAS-based implementations when a kernel is memory bound and that loop fusion across subroutine boundaries reduces memory traffic. In other situations, the BTO compiler usually achieves slightly lower performance because it relies on the native compiler for low-level optimizations and it does not perform other optimizations such as loop tiling. The graphs in Figure 11 show the performance, on an Intel Core 2, of four BTO-generated kernels compared to BLAS-based implementations. The graphs in Figure 11 show that the BTO versions are 21% to 137% faster for level 2 kernels (ATAX, GEMVER) when the matrix order is 1000 or larger. For level 1 kernels (VADD, WAXPBY), the BTO-generated code is 60% to 88% faster. The performance difference for both the level 1 and 2 kernels is due to increased data reuse in the L2 cache. Table 4 shows L2 cache misses on the Intel Core 2 for BTO-generated code and the BLASbased implementations. The BTO code has between 35% and 62% the number of L2 misses of the BLAS-based implementations.
Other kernels where loop fusion significantly reduces memory traffic are BiCGK, DGEMVT, MADD, and AXPYDOT (see Table 4 ). The performance increases for these kernels is summarized in Tables 5 and 6 , which compare the results of BTO to the best performing BLAS library for two representative sizes.
There are kernels where the BTO compiler performs loop fusion and does not achieve speedups over BLAS alternatives. For example, the BLAS-based GESUMMV is faster than the BTO-generated code even though our compiler can fuse two DGEMV calls. We do not see memory (Table 4) or performance (Tables 5 and 6 ) gains for GESUMMV because fusion reduces memory accesses by only O(n), which is dominated by the overall O(n 2 ) memory accesses, where n is the matrix order. Further, we rely on the native C++ compiler for low level optimizations such as vectorization and instruction scheduling, but the native compilers do not optimize as well as the hand tuning done by Goto and the Intel and AMD teams. We also see slightly more L2 misses in the BTO version of GESUMMV than in the MKL version (Table 4) , which we conjecture could be due to tiling in the MKL BLAS. The lack of low level optimizations in the BTO compiler also explains the lower performance for small matrix orders on the ATAX and GEMVER kernels. For sizes less than 1000, when all data fits in cache, memory traffic is not a bottleneck: vectorization and instruction scheduling in the inner loop become the primary factors in performance.
The DGEMV kernel is representative of kernels with no fusion opportunities (beyond the obvious fusion already within DGEMV). In such situations, the reliance of BTO on the native compiler makes an even bigger difference. The the BTO-generated DGEMV is 17% slower than MKL's for large matrices on an Intel Core 2 (Table  5 ) and 39% slower than the GotoBLAS on an AMD Opteron (Table 6). The reasons for the lower performance are the same as discussed above for GESUMMV.
For kernels such as ATAX and GEMVER, one might expect even larger performance improvements than what we observe. For these kernels, loop fusion saves memory accesses but introduces another inefficiency. Both kernels read a column or row of the matrix from memory and then perform a dot product. Then the column or row is read from cache and a DAXPY is performed. However, while the column or row is read from cache, the memory bus is idle. Because moving data through the memory bus is bottleneck of these kernels, leaving it idle reduces the speedup achieved through fusion. This problem can be solved by a form of vector-scale software pipelining, which we plan to integrate into a future edition of the BTO compiler.
Efficiency of Hybrid Analysis
This section demonstrates how our hybrid analytic/empirical approach reduces the overall compile-time without decreasing the performance of the generated code. In Table 7 , we show results for each algorithm for two different matrix orders or vector dimensions (denoted size in the table) on an Intel Core 2. The first column of the table gives the kernel name followed by the number of differently-optimized versions produced by the compiler.
The column headed Best Mflops shows the best performance from the compiler's empirical search while Model's Top 1% describes the versions with predicted performance within 1% of the memory model's fastest prediction. Range shows the performance range as compared to the Best column and Count shows the number of versions in that top 1%. For most of the kernels, the predicted top 1% achieve within 10% of the actual best performance, demonstrating that the analytic model is indeed picking out the best versions. The only exception is AXPYDOT, where one of the versions from the predicted top 1% only achieves 75% the performance of the best. We are looking into this discrepancy in the analytic model.
When there is little performance change from version to version, as in GESUMMV, the analytic model is not able to reduce the set for empirical testing. The Range column shows that the 12 versions vary by a maximum of 8.3%. All 12 versions perform well, so the next edition of our compiler will perform further pruning in such situations and only empirically test a small percentage of the bestperforming versions. Table 8 shows the time taken to compile these kernels. Model Time is the time to analyze all versions for two different sizes of data (sizes are the same as shown in Table 5 ). The two columns under Empirical Time show the time taken to empirically test the same two sizes of data using the model's top 1% and to empirically test all versions respectively. The two columns under Total Time show the total compile time, both with pruning (and only empirically testing the top 1%) and without pruning (empirically testing all versions). In situations where the model predicts only one version in the top 1%, the compiler does not perform empirical testing. For BiCGK, note that there is only one version in the top 1% for matrix size 1000 but two for 10,000.
The table shows that, by testing all versions predicted to run within 1% of the best predicted version, we find the best actual version. The amount of time it takes to achieve these results is under fifteen seconds for all but GESUMMV. Also, using the model dramatically reduces the amount of time it takes to find the best version while adding an insignificant cost when it fails to differentiate between versions. Tests on the AMD Opteron produced the same findings with one exception: for GEMVER, the model does not find the best version. It does however find a version within 15% of the best for a matrix size of 1000 and within 5% of best for a matrix size of 10,000. Thus, there is room for improvement in our modeling of the AMD Opteron.
CONCLUSIONS AND FUTURE WORK
In this paper, we present a compilation framework that transforms linear algebra specifications into C++ after enumerating combinations of two loop fusion decisions. We also describe an analytic performance model and show that it can accurately distinguish between significant performance differences in the compilerproduced code. We show that, by using the model, the compiler can focus empirical testing and greatly reduce compilation time without significantly decreasing performance of the generated code.
Our future plans for the compiler include increasing the kinds of optimizations it can perform, including automatic parallelization, loop tiling, software pipelining, and array padding. As these optimizations are added, we plan to update the memory model as needed to predict their effects. Also, we plan to create a symbolic version of the memory model so that we can quickly identify regions (with respect to matrix order and vector size) that have similar memory behavior. This will enable a reduction in the model's execution time and will make possible the use of different techniques (such as loop tiling) for different matrix orders. Additionally, we will explore whether an adaptive memory model would help the compiler to differentiate more finely between routines. A model improved in this manner would account for more hardware and algorithm factors such as cache associativity and varying reuse distance.
