Abstract-The performance bottleneck for many scientific applications is the cost of memory access inside linear algebra kernels. Tuning such kernels for memory efficiency is a complex task that reduces the productivity of computational scientists. Software libraries such as the Basic Linear Algebra Subprograms (BLAS) ameliorate this problem by providing a standard interface for which computer scientists and hardware vendors have created highly-tuned implementations. Scientific applications often require a sequence of BLAS operations, which presents further opportunities for memory optimization. However, because BLAS are tuned in isolation they do not take advantage of these opportunities. This phenomenon motivated the recent addition to the BLAS of several routines that perform sequences of operations. Unfortunately, the exact sequence of operations needed in a given situation is highly application dependent, so many more routines are needed.
Abstract-The performance bottleneck for many scientific applications is the cost of memory access inside linear algebra kernels. Tuning such kernels for memory efficiency is a complex task that reduces the productivity of computational scientists. Software libraries such as the Basic Linear Algebra Subprograms (BLAS) ameliorate this problem by providing a standard interface for which computer scientists and hardware vendors have created highly-tuned implementations. Scientific applications often require a sequence of BLAS operations, which presents further opportunities for memory optimization. However, because BLAS are tuned in isolation they do not take advantage of these opportunities. This phenomenon motivated the recent addition to the BLAS of several routines that perform sequences of operations. Unfortunately, the exact sequence of operations needed in a given situation is highly application dependent, so many more routines are needed.
In this paper we present preliminary work on a domainspecific compiler that generates implementations for arbitrary sequences of basic linear algebra operations and tunes them for memory efficiency. We report experimental results for dense kernels and show speedups of 25% to 120% relative to sequences of calls to GotoBLAS and vendor-tuned BLAS on Intel Xeon and IBM PowerPC platforms.
I. INTRODUCTION
Many scientific applications are memory bound (Gropp et al., 1999) . Tuning for memory efficiency is a complex task that requires balancing interactions between algorithm, data structure, compiler, and computer architecture. To move this burden from the computational scientist to the computer scientist, standard interfaces were developed for linear algebra kernels-the Basic Linear Algebra Subprograms (BLAS) ) -and highly-tuned implementations have been developed (Bilmes et al., 1997; Goto, 2007; IBM, 2007; Intel, 2007; Whaley and Dongarra, 1998) . These implementations deliver peak performance for BLAS level 3 routines (e.g., matrix-matrix multiplication) which have a high ratio of floating-point operations to memory accesses. The main optimization technique they use is blocking to improve the reuse of data in caches, registers, and the TLB (Goto and van de Geijn, 2008 ). However, for BLAS level 1 and 2 operations, which have a lower ratio of floating-point operations to memory accesses, performance is a fraction of peak due to bandwidth limitations (Howell et al., 2008) .
Scientific applications often require sequences of BLAS level 1 and 2 operations and many researchers have observed that such sequences, when implemented as a single specialized routine, can be optimized to reduce memory traffic (Baker et al., 2006; Howell et al., 2008; Vuduc et al., 2003) . This phenomenon motived the recent addition of kernels such as GEMVER and GEMVT to the BLAS (Blackford et al., 2002) and their use in Householder bidiagonalization in LAPACK (Howell et al., 2008) . The GEMVER kernel is a combination of outer products, matrix-vector products, and vector additions.
Using the original BLAS, this sequence of operations would require two calls to the GER routine and two calls to the GEMV routine. For a matrix A larger than cache (the common case for many applications) this sequence of BLAS requires reading A from main memory four times and writing A twice. In contrast, if this kernel is implemented as a single routine, the loops can be fused so that A is only read from and written to main memory once.
Great care was taken in the original choice of BLAS to pick operations that would be widely applicable, thereby amortizing the fairly large per-routine implementation cost. The problem with larger kernels such as GEMVER and GEMVT is that they are much more application specific. To provide the best performance across a wide range of applications we need many such kernels.
In this paper we describe preliminary work on a domainspecific compiler whose input is a kernel specification (such as the above definition of GEMVER) and the desired input and output matrix formats and whose output is a memory-efficient implementation. The key ingredients of our compiler are 1) an input language that is general enough to express the BLAS and similar kernels but is relatively easy for the compiler to analyze, 2) a refinement-based approach that gradually lowers and optimizes a dataflow graph representation of the kernel, 3) an abstract representation of loops, i.e. iterations, that allows the compiler to handle a variety of matrix storage formats and parallel decompositions, and 4) a decoupling of the compiler into a general graph rewriting engine and a database of linear algebra knowledge. Our compiler extends previous work on automatic tuning (Bilmes et al., 1997; Whaley and Dongarra, 1998) by raising the scope of optimization from just one linear algebra operation, such as matrix-matrix multiplication, to a sequence GEMVER in u1 : vector , u2 : vector , v1 : vector , v2 : vector , alpha : scalar , beta : scalar , y : vector , z : vector inout A : dense column matrix out x : vector , w : vector { A = A + u1 * v1 ' + u2 * v2 ' x = beta * ( A ' * y ) + z w = alpha * ( A * x ) } Fig. 1 . An example specification. This kernel is the GEMVER operation from the updated BLAS (Blackford et al., 2002). of operations. Our domain-specific approach is quite different from that of most general-purpose high-performance compilers, where the state of the art applies affine transformations to loops over dense arrays (Kandemir, 2005) . While our approach is only applicable to linear algebra operations, linear algebra is an important domain and our approach enables the compiler to handle sparse as well as dense matrix formats (see section III-C).
The rest of the paper is organized as follows. Section II introduces the goals for our compiler and reviews tuning techniques for memory efficiency. Section III describes our compilation framework and section IV compares the performance results of our prototype compiler to several BLAS implementations-GotoBLAS, MKL, and ESSL-on three dense kernels over two architectures. Section V discusses related work, and section VI outlines the considerable future research that remains to be done and presents our conclusions. Figure 1 shows an example input to our compiler. The name of the kernel is given at the top followed by the parameters with their kinds (matrix, vector, or scalar) and storage format. We use the same syntax as MATLAB to specify the operations. Our prototype compiler handles dense row and column-oriented matrices. The ultimate goal is to handle a wide range of matrices formats such as those found in the updated BLAS (Blackford et al., 2002) . The body of the specification consists of a sequence of assignments with basic linear algebra operations on the right hand sides. Our current focus is on level 1 and 2 operations as there is more to be gained from combining and specializing them than from level 3 operations.
II. GENERATING MEMORY EFFICIENT KERNELS
Our compiler generates an implementation of the kernel in C (a Fortran back-end is also planned) consisting of a sequence of loop nests that read and write elements from the matrices and vectors and perform arithmetic operations. The main challenge is to choose a set of loop nests that minimizes the memory traffic for a given platform.
Before discussing the details of the compiler, we review tuning techniques that reduce memory traffic. There are many techniques that improve performance in other ways, but they are not the focus of the current paper. In general, to reduce traffic between main memory and cache, we need to improve temporal locality and/or spatial locality. A program that references the same piece of data multiple times in close succession has temporal locality. A program that successively references data that are close together has spatial locality. A cache can reduce the traffic to main memory when a program exhibits temporal locality because it stores the often-used pieces of data in higher-speed memory. A cache can also reduce traffic when a program exhibits spatial locality because data is moved from main memory to cache one cache line at a time (e.g., 128 bytes on the Pentium 4), so an access to the same cache line does not incur further traffic.
A. Loop fusion
The primary advantage of combining a sequence of BLAS into a single routine is that loops from separate routines can be merged into a single loop, a technique called loop fusion (Bacon et al., 1994) . If two loops reference the same matrices or vectors, the temporal locality of those references can be improved by merging the two loops into a single loop. To be a candidate for loop fusion, the loops must have compatible iterations, for example, they could both iterate over the column dimension or both iterate over the row dimension of the matrix. (There are also some conditions regarding data dependencies.) We illustrate loop fusion by applying it to the GEMVER kernel. The BLAS-based implementation of GEMVER consists of four procedure calls, each of which contains a loop nest. Below we show the loop nests and omit the procedure boundaries:
These loops can be fused into a single loop:
Assuming that a column of A can remain in cache throughout an iteration of the loop, the fused implementation only reads and writes A once from main memory. Loop fusion is also applicable at the vector-operation level. For example, the two rank-1 updates in GEMVER:
Fusing the rank-1 updates reduces the memory traffic needed to write and read the result of the first rank-1 update.
While it is often beneficial to fuse loops, it is not always so. Suppose that there is only enough room in cache for two arrays of length m. Then it is not profitable to fuse the rank-1 updates in GEMVER because doing so brings three arrays of length m (u 1 , u 2 , and A(:, j)) through the memory hierarchy at the same time, causing some of A(:, j) to be evicted. Thus the compiler needs to take into account the specifics of the architecture as well as matrix order, storage format, and sparsity.
B. Contiguous Data Access
The most efficient way to access a matrix is the order that the elements appear in memory. So, for example, a matrix stored in column-major order should be traversed one column at a time and a tridiagonal matrix should be traversed one diagonal at a time (Anderson et al., 1990 ). The reason is that contiguous access improves spatial locality. For sparse matrices, traversing according to the storage order is even more important because traversing in non-storage order requires searching for each element. Thus, it is crucial for us to take into account which iteration patterns result in contiguous access for each matrix storage format.
C. Loop Blocking
Loop blocking (also called tiling) is a technique that increases temporal locality by subdividing a loop so that data is reused before moving on to other data . Blocking is particular effective at reducing memory traffic in BLAS level-3 kernels, where each element of a matrix is used n times. Blocking can also aid level-2 kernels when the vectors no longer fit in cache (Nishtala et al., 2004) , which sometimes occurs in large sparse matrix computations. For example, consider the following matrix-vector multiplication on a column-major matrix A. To simplify the presentation, we assume A is dense.
Each inner loop traverses all of vector y, so if y is larger than cache then some of it will be evicted and read back in on each iteration. To apply blocking, an outer loop is introduced that iterates over blocks of y. The block size b is chosen so that blocks y(s : e) and A(s : e, j) fit together in cache. Thus, each block y(s : e) is only read from and written to main memory once. In the following we assume that b divides m evenly.
D. Matrix Storage Formats
Matrices often have special structure that can be exploited by an appropriate choice of storage format. For example, for a symmetric matrix there is no need to store the entire matrix, we can just store the lower (or upper) triangle. For a matrix where all non-zero entries are close to the main diagonal, a banded storage format can be used. Not only does this reduce the amount of memory needed for the matrix, but it also speeds up computations on the matrix because no work is performed for the zero entries. Many matrices in practice are sparse, and formats such as Compressed Sparse Column (CSC) save memory by not storing the zero elements. The CSC format stores each column as two parallel arrays: the elements and their row indices. While this greatly reduces the overall memory footprint, the bandwidth needed per floatingpoint operation is increased because the stored indices must also be retrieved. Figure 2 gives an overview of the compilation process. The kernel specification is parsed into a dataflow graph. This dataflow graph is then iteratively processed until all of the implementation choices have been made. One example of a choice is whether to apply an optimization to a particular point in the dataflow graph. Our prototype compiler always applies an optimization when it is legal, whether or not it may be profitable. We plan to add backtracking search to explore all the alternatives, and we discuss issues regarding choosing between alternatives, the size of the search space, and pruning in section III-E. The compilation process consists of three phases: analysis, refinement, and optimization, which are together iterated until all of the implementation decisions have been made. The graph is then translated into C code.
III. THE COMPILATION FRAMEWORK

A. The Dataflow Graph
An example dataflow graph for the GEMVER kernel (from figure 1) is given in figure 3 . Each node represents a parameter of the kernel or an operation. The arrows indicate the flow of data. At first the graph specifies what operations are to be performed but does not contain any implementation details. The symbol × in the depicted graph stands variously for outer product, matrix-vector multiplication, and scalarvector multiplication, and it does not yet specify, for example, whether the outer products compute a row or a column at a time of the result matrix. 
B. Analyze the Dataflow Graph
At the beginning of the analysis phase, each input and output node of the dataflow graph is labeled with its kind (scalar, vector, matrix, or unknown) and storage format (dense row-major, dense column-major, or unknown) according to the annotation given by the user. During analysis, the kind and storage format are computed for the intermediate nodes and algorithms are assigned to operation nodes. The choice of algorithm and the determination of kinds of storage formats are interrelated so they must be computed together. For example, the × symbol in the graph in Figure 3 having the inputs u 1 and v T 1 can be implemented by iterating over rows first, or over columns first. The choice depends on how the result is used downstream in the dataflow graph. In this case, the result is added to the outer product of u 2 and v T 2 , so we still could choose either option as long as we make the same choice for both outer products. Going one more step downstream, there is an addition with A, which was annotated in Figure 1 to be column major. At this point it is clear that the outer-products should be computed in column-major order.
We do not hard code this knowledge of basic linear algebra in the compiler algorithm itself, but instead use a data-driven approach and store information regarding how to implement basic linear algebra in a separate database, which we refer to as the linear algebra database. This separation allows us to add new matrix formats, operations, and basic linear algebra algorithms without changes to the compiler algorithm. Figure 4 shows a portion of the database. Consider the database record for the matrix-vector multiply based on inner products, i.e, the algorithm labeled mv-dot. The algorithm implements the multiplication (×) operation for a row-major matrix (Arg 1 = R,M) and column vector (Arg 2 = C,V) and produces a column vector (Result = C,V). The mv-dot algorithm can be decomposed into independent operations using the equation given in the Specification column of the database, so the result of mv-dot may be streamed into a down stream computation using loop fusion or pipeline parallelism.
The analysis algorithm makes implementation choices using the most-constrained first strategy (also called minimum remaining values) from the artificial intelligence literature (Russell and Norvig, 2003) . The compiler chooses the node with the fewest matching implementations (in the linear algebra database) and assigns a matching implementation to the node. If there is more than one match, the prototype compiler picks the first. Once we add backtracking we will push each alternative onto the stack to be explored later. Once the choice is made, the kind and storage format labels for the operation node itself and for the input nodes are updated with the labels specified in the linear algebra database for the chosen algorithm. The algorithm then goes on to find the next node with the fewest matching implementations and repeats. In this phase, only the name of the algorithm is assigned to the operation node. The details of the implementation are not specified until the refinement phase.
C. Refine the Dataflow Graph
The refinement phase resolves the implementation for each operation node in the graph into a subgraph defining the details of the chosen algorithm. Each subgraph is an abstract representation of the loop that implements the given operation. In each case, the subgraph has an iteration strategy that specifies how to traverse the elements of the matrix or vector. Figure 5 shows two steps of refinement for a matrixvector multiplication. The first step expands the matrix-vector multiplication according to the mv-dot algorithm from figure 4. The refinement step replaces the × node with a subgraph that computes the inner product of a row of the matrix-the node labeled (i, :)-with the vector, storing the result in the ith element of the result vector-the node labeled (i). The subgraph is labeled with i = 1 : m indicating that the iteration strategy is to traverse the rows of the dense matrix. The second pass of refinement introduces another subgraph to implement the inner products (the dot algorithm from figure 4). The new subgraph iterates through each dense row, as indicated by the j = 1 : n annotation. The sign Σ indicates that the sum of all of the iterations is computed.
It is straightforward to see how the refinement process can be augmented to handle other matrix formats, such as sparse or banded: the iteration strategies used by the compiler can be extended to include sparse or banded iterations. Suppose the matrix in figure 5 were a sparse matrix in Compressed Sparse Row format. The inner subgraph would use a sparse iteration strategy instead of a dense iteration strategy, so the j index would be retrieved from the matrix storage instead of incrementing from 1 to n.
D. Optimize the Dataflow Graph
In the optimization step, the dataflow graph is optimized by applying conditional graph rewrite rules. One example rewrite rule is Merge Operand Sharing Subgraphs: if two subgraphs share a common operand but do not depend on one another, and if the iteration strategies of the two subgraphs are compatible, then merge the subgraphs into a single subgraph. This rule is responsible for fusing the loops of the two matrixvector products in the GEMVER kernel. Another example rule is Merge Dependent Subgraphs: if one subgraph depends on another subgraph, and the iteration strategies of the two subgraphs are compatible, then merge them into a single subgraph. This rule, for example, is responsible for fusing the loops of the two outer products in the GEMVER kernel.
The choice of which optimizations to apply to various parts of the graph depends on the characteristics of the kernel, the al- Fig. 4 . A database characterizing some algorithms for basic linear algebra operations. Each argument is marked as being row (R) or column (C) oriented, either a scalar (S) , vector (V), or matrix (M). The database also records whether an algorithm can be decomposed into independent sub-computations. gorithm, the matrix order and kind, and architectural features.
The following is an instructive example: y ← A T A(x+z+w). Assume that the compiler has already merged the two vector additions into one subgraph, call it S 1 , and the two matrix products into another subgraph, call it S 2 . The question then is whether it is profitable to apply the Merge Dependent Subgraphs rule one more time to merge S 1 and S 2 . If the three vectors plus a row of A do not fit in cache, then the merger will not be profitable because it will cause each row of A to fall out of cache before the row gets used a second time. On the other hand, if all three vectors plus a row of A do fit in cache, then S 1 and S 2 should be merged. In general, we need a way to search through the many different combinations of optimization choices and choose a combination that minimizes memory traffic.
E. Search Space Exploration and Evaluation
The set of all possible combinations of implementation and optimization choices is called the search space (Kisuki et al., 2000) . Although the dataflow graphs are quite small in our setting (usually less than 100 nodes), it is still important to prune the search space to avoid an exponential compilation time. Our prototype already performs some pruning: the optimization rewrite rules only apply under certain circumstances. However, as mentioned above we are sometimes overly aggressive in applying optimizations. To solve this problem we plan to use a combination of analytic methods and empirical methods (Chen et al., 2007) . We plan to use a fast analytic cost function to prune out regions of the search space that will not produce competitive implementations. The cost function is based on a model of the computer architecture, but does not necessarily account for all the details. Once the search space is narrowed, we will use empirical testing with representative data sets to determine the exact performance of the alternatives and choose the best one.
F. Translate the Dataflow Graph to C
A graph that cannot be further refined has been reduced to a collection of subgraphs representing loops over scalar operations. The generator outputs a C loop for each subgraph. The order in which the loops, and scalar operations within the loops, appear in the output is determined by performing a topological sort.
IV. PERFORMANCE EXPERIMENTS
In figures 7 and 8 we compare the performance of kernels generated by our prototype compiler to the performance of sequences of calls to BLAS. Figure 7 reports experiments on a 2.4 GHz Intel Xeon processor with 4GB RAM and 4Mb L2 cache. Figure 8 reports experiments on a 1.6 GHz IBM PowerPC 970 with 2.5GB RAM and 512k L2 cache. The figures show the measured Mflops rating versus matrix order for the three kernels defined in Figure 6 . Because the results for large and small matrices are so different, two different plots are shown for each kernel. The graphs in the top row of figures 7 and 8 show the results for small matrices (order up to 1000) and the graphs in the bottom row show the results for large matrices (order from 1,000 to 10,000). Our primary goal was to improve performance on large matrices: for these kernels we see speedups between 15% and 120%.
The results from our prototype compiler are marked BTO for Build to Order. (We explain the curve marked BTO-vec below.) The Netlib BLAS (Netlib, 2007) are for the most part unoptimized and therefore serve as a baseline. Our generated code differs from the Netlib BLAS only in that the compiler applies loop fusion; neither performs optimizations on the inner-loop such as loop unrolling, software pipelining, or vectorization. We also provide the performance numbers for the GotoBLAS (Goto and van de Geijn, 2008) , the vendorsupplied Math Kernel Library (Intel, 2007) , and Engineering and Scientific Subroutine Library (IBM, 2007) . These highperformance implementations include inner-loop optimizations and therefore serve as a good way to compare the benefits of those optimizations to loop fusion.
For the Intel experiments (figure 7) we used the icpc compiler with the -O3 optimization flag for both the generated kernels and the Netlib BLAS. For the IBM PowerPC experiments (figure 8) we used the xlc compiler with the -O3 optimization flag. The results show that the inner-loop optimizations used in Goto, MKL, and ESSL give a large benefit for the small matrices but only a small benefit for large matrices. On the other hand, for matrices that are too large to fit in cache, loop fusion gives BTO a speedup of roughly 100%, 15%, and 50% for gemver, dgemvt, and bicgkernel respectively compared to GotoBLAS and MKL on the Intel Xeon. On the IBM PowerPC, the speedup is 100%, 5%, and 30% on the three kernels compared to GotoBLAS and 100%, 30%, and 60% compared to ESSL.
To give a rough idea of the performance of combining the inner-loop optimizations and loop fusion, we provide the curves marked BTO-vec which shows the results of compiling the BTO generated code using the -xP and -noalias flags of the underlying Intel compiler, and the -O4, -qaltivec, -qarch=ppc970 flags of the IBM compiler. These flags enable vectorization of the inner-most loop. Combining vectorization with loop fusion results in Mflop ratings that are competitive with those of GotoBLAS and MKL for small matrices. It also gives a boost to the results for large matrix orders. BTO-vec acheives speedups of 120%, 25%, and 60% compared to the GotoBLAS and MKL on the Intel Xeon. The vectorization flags did not improve performance on the IBM PowerPC.
V. RELATED WORK
The MaJIC MATLAB compiler optimizes matrix expressions by re-associating them to reduce time complexity (Menon and Pingali, 1999) , but it does not perform loop fusion across multiple operations as we do in this paper. Adding re-association to our compiler is an area of future work. The telescoping languages project (Kennedy et al., 2005) optimizes MATLAB scripts by analyzing and transforming library calls using techniques such as procedure specialization, strength reduction, and vectorization. Similarly, the Broadway Compiler (Guyer and Lin, 2005) optimizes calls to libraries such as PLAPACK (Alpatov et al., 1997) . Our work differs in that we generate the linear algebra library routines instead of optimizing pre-existing libraries.
Several linear-algebra specific compilers in the literature generate implementations of a sparse matrix operation based on a specification written as a loop nest over a dense matrix (Bik et al., 1998; Pugh and Shpeisman, 1998) . Our compiler differs in that specifications are written at a higher level (matrix algebra) and we optimize sequences of operations for memory efficiency. Domain-specific compilers are gaining momentum and there are notable examples in other domains (Allam et al., 2006; Püschel et al., 2005; Quinlan et al., 2006) .
A closely related approach to domain-specific compilation is that of active libraries (Czarnecki et al., 2000) . The work presented here is a generalization of earlier work by the first author on the Matrix Template Library (MTL) (Siek, 1999) . There are other linear algebra libraries based on expression templates (Ahmed et al., 2000) , but none of them optimize across statements due to inherent limitations of the expression template approach (Veldhuizen, 1998) . The TaskGraph active library applies loop fusion across statements at run-time using a dataflow representation (Russell et al., 2006) . Our approach, in contrast, performs optimizations statically and separates the compiler into a graph rewriting engine and linear algebra database for extensibility.
The Hierarchically Tiled Arrays (HTA) library (Bikshandi et al., 2006) provides an abstraction that makes it easier to express loop blocking-similar abstraction may be found in (Siek, 1999 )-but HTA does not fully automate the implementation as we do here. Array libraries such as Blitz++ perform loop fusion to reduce memory traffic (Veldhuizen, 1998) . The Formal Linear Algebra Methods Environment (FLAME) (Bientinesi, 2006; Gunnels et al., 2001) facilitates the implementation and generation of correct linear algebra algorithms, but does not include an automatic optimization phase as in our work. The iterative and empirical compilation framework (iFKO) (Whaley and Whalley, 2005) automates the tuning of linear algebra operations, such as matrix multiplication, whereas our work tunes sequences of operations.
There is considerable literature on the optimization of arrays in languages such as HPF (Kennedy et al., 2007) . Our work differs in that it is specific to linear algebra and can handle a wide range of matrix storage formats.
VI. CONCLUSION AND FUTURE WORK
In this paper we presented a compilation framework for generating efficient implementations of arbitrary sequences of basic linear algebra operations. The compiler represents the operations in a dataflow graph and refines the graph, making implementation decisions and applying high-level optimizations. The compiler is domain specific, incorporating knowledge of linear algebra data-structures and algorithms into the compiler via an extensible database. Our performance results show that this approach is promising: the compiler generates code that is 25-120% faster than state-of-the-art BLAS implementations. In the future we plan to implement a hybrid analytic/empirical search of the optimization alternatives as discussed in section III-E. We plan to expand the linear algebra database to include more matrix formats, especially sparse matrix formats. We are looking at multi-core parallelization techniques and plan to incorporate those into the compiler.
ACKNOWLEDGMENTS
We thank the anonymous reviewers for suggestions that improved this paper. . Performance of our generated kernels versus sequences of BLAS (Netlib, Goto, and Intel MKL) on an Intel Xeon. The graphs in the top row show the performance for small matrices (orders 10 to 1k) and the bottom row shows the performance for large matrices (1k to 10k).
NASA AIST grant #NAG2-1646, DOE SciDAC grant #DE-FG02-04ER63870, NSF sponsorship of the National Center for Atmospheric Research, a grant from the IBM Shared University Research (SUR) program and the Intel Corporation.
